Difference between revisions of "Utilities"
m (Reverted edits by LindaLopez (talk) to last revision by Olesen) |
|||
Line 1,074: | Line 1,074: | ||
* [[Media:Filter-accounting.txt | filter-accounting]] | * [[Media:Filter-accounting.txt | filter-accounting]] | ||
− | + | * [http://dlemonie.com/ sari lemon untuk diet] | |
== Other Miscellaneous == | == Other Miscellaneous == | ||
Some others, partially overlapping with some above are listed under | Some others, partially overlapping with some above are listed under | ||
http://www.nw-grid.ac.uk/LivScripts. | http://www.nw-grid.ac.uk/LivScripts. |
Revision as of 03:32, 16 May 2017
Contents
Load Script
The following script just creates jobs. It is useful for testing your settings. No error checking and you have to modify the -q
option to fit with your system.
#!/bin/sh # First argument is the number of jobs # Second argument is the seconds to sleep QSUB_OPTIONS=" \ -q "'*&!boinc'" \ -cwd \ -j y \ -V \ -N load \ -o `pwd`/load.out \ " for i in `seq $1` do qsub $QSUB_OPTIONS /sjr/beodata/local/bin/vanilla_job.sh sleep $2 done
Using Ganglia as Load Sensor
See Using Ganglia As Load Sensor for guidance on using Ganglia as a load sensor for Grid Engine.
Modding qstat
The output from qstat -ext
, while complete, is overly verbose (and over 200 characters wide) in many cases. In SGE 5.3, this will strip out bits about the Department, Deadline, PE Master, and Array Task columns; items frequently unused. It makes a good alias (such as "eqstat", or something):
qstat -ext | cut -c 1-33,39-45,66-92,110-191
Modding qstat Redux
A longer script, but more condensed output from Andy Schwierskott, pulled from the SGE mailing list:
#!/bin/sh echo "JobId P S Project User Tot-Tkt ovrts otckt dtckt ftckt stckt shr" echo "---------------------------------------------------------------------------------------" qstat -ext -s rs | grep -v job-ID | sed /-------------/d | \ gawk '{ printf("%5s %4s %4s %8s %8s %7s %7s %7s %7s %7s %7s %4s\n", \ $1, $2, $7, $5, $4, $13, $14, $15, $16, $17, $18, $19) }' echo "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -" qstat -ext -s p | grep -v job-ID | sed /-------------/d | \ gawk '{ printf("%5s %4s %4s %8s %8s %7s %7s %7s %7s %7s %7s %4s\n", \ $1, $2, $7, $5, $4, $10, $11, $12, $13, $14, $15, $16 ); }'
User Management
If you have lots of users and groups to add at once, and your userlists map to unix groups, qconf
can help automate this a bit. This awk
snippet looks for entries starting with "grp", and generates a qconf
entry to create a matching userlist.
awk -F: '/^grp/{print "qconf -au ",$4,$1}' /etc/group
qcd
If your user's have the same shared file space as the cluster this alias will change directory to the current working directory of the passed in job id.
It has to be an alias (or sourced) in order to affect the current shell.
I use the tcsh, and that is why there is all of the bizarre back slashing.
Add the line below to the system shell startup file or ~/.tcshrc
# tcsh: alias qcd cd\ \`qstat\ -j\ \!\*\|awk\ \'/^cwd/\{print\ \$2\}\'\`
For bash/ksh/zsh
, a shell function works well:
function qcd { cd `qstat -j $1 | awk '/^cwd/{print $2}'`; }
Might could revisit this to make it a little nicer with the new XML format?
qavailable_processors
This script just sums up the processors that aren't performing a task. You will have to change the qstat options to suite your configuration.
#!/bin/sh qstat -g c -l arch=lx24-amd64 -q all.q | awk 'NR > 2 {sum = sum + $4} END {print sum}'
qtail
The following script will 'tail' the end of the standard output file of the passed in job id.
#!/bin/sh if [ $# -ne 1 ]; then echo "Usage:" echo " $0 <sge_job_number>" exit fi out_path=`qstat -j $1|grep ^stdout_path_list|awk '{print $2}'` if [ "X$out_path" == "X" ]; then exit fi if [ $out_path == "/dev/null" ]; then echo "Standard output is not available because directed to /dev/null" exit fi tail -f $out_path
rj: qsub wrapper script
I created a wrapper script for qsub that allows my users to not need to know the complexity of the cluster configurations or job submission. I called the wrapper script 'rj' (for run jobs) in order to distinguish it from the collection of 'q*' apps that come with Grid Engine.
It has the following features:
- Detects whether the application is checkpointable (with Condor)
- Detects whether the application is MPI (currently known to work with OpenMPI and possibly MPICH), and then checks and makes sure gets the option to specify number of processors
- Sets the maximum number of parallel processors to what is available so that the job can be dispatched immediately
Realize that this is very specific to my setup, and you will need to go through it and edit to match what you want it to do. It also has a lot of history that should probably be re-written. Approaching the idea to allowing 'rj' to accept all arguments that 'qsub' accepts and pass through, defaulting or forcing those options that I need to change.
#!/bin/bash # Default WHOAMI=`whoami` EMAIL=${WHOAMI}@sjrwmd.com EXTRA_OPTIONS=" \ -q "'*&!boinc'" \ -cwd \ -j y \ -M ${EMAIL} \ -V \ " PE=mpi NP_OPTION=0 P_OPTION=0 I_OPTION=0 function usage(){ echo " rj [-np NUM] [-i INPUT_FILE] [-mb] EXECUTABLE # EXECUTABLE runs on appropriate node. rj -np NUM MPI_EXECUTABLE # Runs the MPI parallel MPI_EXECUTABLE on NUM nodes. # NUM can be a single number or a range, for example 2-8, would run the # MPI_EXECUTABLE on at least 2 processors up to 8 processors. rj -i INPUT_FILE EXECUTABLE # Runs EXECUTABLE with interactive input supplied by INPUT_FILE. rj [-mb] EXECUTABLE # -b Waits for EXECUTABLE to finish (blocks) before returning. # -m Sends e-mail at end of job." exit 1 } # Detect the project option -P and the -i option # Didn't try getopt - suspect wouldn't work since command can have options. while [ $# -ne 0 ] do case $1 in -P) EXTRA_OPTIONS=" $EXTRA_OPTIONS $1 $2 " shift shift P_OPTION=1 ;; -i) EXTRA_OPTIONS=" $EXTRA_OPTIONS $1 $2 " shift shift I_OPTION=1 ;; -np) MAXNP=$2 max_num_slaves=`qstat -g c -l arch=lx24-amd64 | awk 'NR > 2 {sum = sum + $4} END {print int(sum - 1 - 0.1*sum)}'` if [ $MAXNP -gt $max_num_slaves ]; then MAXNP=$max_num_slaves echo "Number of slaves set to maximum allowable = ${max_num_slaves}." fi EXTRA_OPTIONS=" $EXTRA_OPTIONS -pe $PE 1-$MAXNP " shift shift NP_OPTION=1 ;; -b) EXTRA_OPTIONS=" $EXTRA_OPTIONS -sync y " shift ;; -m) EXTRA_OPTIONS=" $EXTRA_OPTIONS -m eas " shift ;; -h|--help) usage ;; *) break ;; esac done if [ $# -eq 0 ]; then usage fi full_path=`which $1 2> /dev/null` if [ "x"${full_path} == "x" ]; then echo "Executable $1 not found in any directory in $PATH" usage fi static=`file -L ${full_path} | grep --count --max-count=1 statically` condor=`nm ${full_path} 2>&1| grep --count --max-count=1 Condor` mpi=`nm ${full_path} 2>&1| grep --count --max-count=1 MPI_` if [ ${mpi} -eq 0 ]; then mpi=`ldd ${full_path} 2>&1| grep --count --max-count=1 libmpi` fi opteron=`file -L ${full_path} | grep --count --max-count=1 'AMD x86-64'` APPEND_VAR="$NP_OPTION$P_OPTION$I_OPTION$static$condor$mpi$opteron" . ${SGE_ROOT}/default/common/settings.sh # JOBNAME has to be set AFTER options are processed JOBNAME=`basename $1` EXTRA_OPTIONS="${EXTRA_OPTIONS} -N ${JOBNAME} -o "`pwd`"/${JOBNAME}.out " case $APPEND_VAR in 1????0?) echo "" echo "You specified multiple processors but the executable is not MPI." echo "" usage ;; 0????1?) echo "" echo "The executable is MPI, please pass the rj script the -np NUM option." echo "" usage ;; 1????1?) qsub ${EXTRA_OPTIONS} \ /sjr/beodata/local/bin/parallel_job.sh $* ;; ???11??) qsub ${EXTRA_OPTIONS} \ -ckpt condor_ckpt \ /sjr/beodata/local/bin/ckpt_job.sh $* ;; ??????1) qsub ${EXTRA_OPTIONS} \ -l arch=lx24-amd64 \ /sjr/beodata/local/bin/vanilla_job.sh $* ;; ???00??) qsub ${EXTRA_OPTIONS} \ /sjr/beodata/local/bin/vanilla_job.sh $* ;; ???10??) echo "" echo "The executable is static, but without Condor." echo "Checkpointing is only available with Condor." echo "Use condor_compile to add the checkpointing libraries." echo "" qsub ${EXTRA_OPTIONS} \ /sjr/beodata/local/bin/vanilla_job.sh $* ;; *) usage ;; esac
Required supporting scripts for 'rj':
vanilla_job.sh
#!/bin/bash echo ${JOB_ID} > sge_job_id : > $SGE_STDOUT_PATH #$ -S /bin/bash $*
parallel_job.sh
#!/bin/bash echo ${JOB_ID} > sge_job_id : > $SGE_STDOUT_PATH #$ -S /bin/bash mpirun -np $NSLOTS $*
ckpt_job.sh
#!/bin/bash #$ -S /bin/bash cd $PWD echo ${JOB_ID} > sge_job_id : > $SGE_STDOUT_PATH CHECKPOINT_DIR="/sjr/beodata/tmp/ckpt" OUTPUT="${CHECKPOINT_DIR}/${JOB_ID}.log" echo "-------------------------------------" >> $OUTPUT if [ ${RESTARTED} -eq 0 ]; then echo " Starting job #${JOB_ID}" >> $OUTPUT $1 -_condor_ckpt ${CHECKPOINT_DIR}/${JOB_ID}.ckpt $@ & else echo " Re-Starting job #${JOB_ID}" >> $OUTPUT $1 -_condor_restart ${CHECKPOINT_DIR}/${JOB_ID}.ckpt & fi PROC_PID=$! echo $! > ${CHECKPOINT_DIR}/${JOB_ID}.pid echo "-------------------------------------" >> $OUTPUT echo " Date: `date`" >> $OUTPUT echo " Job PID: ${PROC_PID}" >> $OUTPUT echo " Hostname: `hostname`" >> $OUTPUT echo "-------------------------------------" >> $OUTPUT wait
User Job Stats
I created this script to automate some monthly reporting I was required to do. It's features are;
- Display figures for the week, month or year to date, or for the specified number of days
- Display total cluster usage, globally and for each user
- Display average cluster usage, for each user and the usage of the 'average' job across the whole cluster
It has a help function with usage examples, but if anyone needs pointers feel free to drop me a line.
NB: To determine which users have actually logged on to the cluster, it gets a list of home directories from the path specified by 'HOME_DIR_LOCATION'. You may need to alter the value of this variable to suit your system for the script to work.
#!/bin/bash # #========================================================================== # Name: User SGE Job Statistics # Author: Chris Bingham # Date: 11.11.2008 # Language: Bash # External References: ls, qacct, grep, cut, expr, getent, tr, date, qconf # # This script will use the SGE command 'qacct' to search back though all the # job records for the specified number of days, and return a table of # statistics for each user's usage of the cluster, sorted highest to lowest. #========================================================================== # Global Variables HOME_DIR_LOCATION="/home" # Store the first two arguments, all other arguments will be discarded DAYS=$1 TOTAL=$2 function display_help() { # Display a help message echo "---User SGE Job Statistics---" echo "This script will use the SGE command 'qacct' to search back though all the" echo "job records for the specified number of days, and return a table of statistics" echo "for each user's usage of the cluster, sorted highest to lowest." echo "" echo "--Usage--" echo -e " user_job_stats.sh [DAYS|OPTION] [total]" echo -e " Where 'DAYS' is a number of days to gather statistics for or 'OPTION' is one of the following;" echo -e " \tweek\t\t\tGather statistics for the week so far" echo -e " \tmonth\t\t\tGather statistics for the month to date" echo -e " \tyear\t\t\tGather statistics for the year to date" echo -e " \thelp\t\t\tDisplay this message" echo -e " Specifying 'total' at the end of the line will generate total usage statistics rather than average usage statistics" echo "" echo "--Usage Examples--" echo -e " user_job_stats.sh 10\t\tGet average statistics for the last 10 days" echo -e " user_job_stats.sh 10 total\tGet total statistics for the last 10 days" echo -e " user_job_stats.sh month\tGet average statistics for the month to date" echo -e " user_job_stats.sh week total\tGet total statistics for the week to date" echo "" echo "--Definitions--" echo -e " CPU Time\t\t\tThe amount of time for which jobs were using CPU resources, measured using SGE's 'cpu' metric from 'qacct'" echo -e " Wallclock Time\t\tThe amount of time for which jobs were running." echo -e " % of Wallclock Time\t\tThe percentage of the cluster's overall total available wallclock time used." echo -e " \t\t\t\tTotal available wallclock time is calculated as: DAYS * Number of CPUs in Cluster" } function averages_or_totals() { # Determine if the user request totals or averages if [ "$TOTAL" = "total" ] ; then gen_totals elif [ -z "$TOTAL" ] ; then gen_averages else display_help fi } function gen_human_readable_time() { # Convert a time in seconds into a more human-friendly scale (hours, days etc instead of seconds) # If the time span is less than 1 hour, convert to minutes if [ "$TIME" -lt "3600" ] ; then TIME=`echo "scale=2; $TIME/60" | bc` TIME="$TIME minutes" # If the time span is less than 1 day, convert to hours elif [ "$TIME" -lt "86400" ] ; then TIME=`echo "scale=2; $TIME/3600" | bc` TIME="$TIME hours" # If the time span is less than 1 week, convert to days elif [ "$TIME" -lt "604800" ] ; then TIME=`echo "scale=2; $TIME/86400" | bc` TIME="$TIME days" # If the time span is less than 1 year, convert to weeks elif [ "$TIME" -lt "31449600" ] ; then TIME=`echo "scale=2; $TIME/604800" | bc` TIME="$TIME weeks" # If the time span is 1 year or more, convert to years else TIME=`echo "scale=2; $TIME/31449600" | bc` TIME="$TIME years" fi } function get_userlist() { # Get a list of usernames by listing the home directories USER_LIST=`ls $HOME_DIR_LOCATION` } function calc_cpu_wallclock() { # Calculate the total amount of available cluster time during the specified number of days # Get a count of the numbe rof execution hosts from SGE CPU_COUNT=`qconf -sep | grep -i sum | tr -d [:alpha:][:space:]` # Calculate the total amount of wallclock for all nodes in seconds CPU_WALLCLOCK=$(($CPU_COUNT * $DAYS * 24 * 60 * 60)) # Calculate 1% of the total cluster wallclock (for later use) CPU_WALLCLOCK_100=`expr $CPU_WALLCLOCK / 100` } function gen_totals() { # Call 'get_userlist' to get a user name list get_userlist # Get the total amount of available cluster time during the specified number of days calc_cpu_wallclock # Reset the total counters to zero TOTAL_PERCENT_CPU_WALLCLOCK="0" TOTAL_CPU_WALLCLOCK="0" TOTAL_CLUSTER_CPUTIME="0" TOTAL_JOB_COUNT="0" # For each username found, do the following; for i in $USER_LIST ; do # Use 'qacct' and 'grep' to count the total number of jobs they've submitted USER_USER_JOB_COUNT=`qacct -o "$i" -j -d $DAYS | grep "jobname" -c` TOTAL_JOB_COUNT=`expr $TOTAL_JOB_COUNT + $USER_USER_JOB_COUNT` # If the user has submitted no jobs, record their utilisation as zero, # else, use 'qacct', 'grep' and 'cut' to get an array of the wallclock and CPU time # counters for all of their jobs if [ "$USER_USER_JOB_COUNT" = "0" ] ; then USER_TOTAL_WALLCLOCK="0" USER_TOTAL_CPUTIME="0" USER_PERCENT_CPU_WALLCLOCK="0" else USER_TOTAL_WALLCLOCK=`qacct -o "$i" -d $DAYS | grep "$i" | tr -s " " "\t" | cut -f 2` USER_TOTAL_CPUTIME=`qacct -o "$i" -d $DAYS | grep "$i" | tr -s " " "\t" | cut -f 5` USER_PERCENT_CPU_WALLCLOCK=`echo "scale=2; $USER_TOTAL_WALLCLOCK/$CPU_WALLCLOCK_100" | bc` fi if [ -z "$USER_TOTAL_WALLCLOCK" ] ; then USER_TOTAL_WALLCLOCK="0" USER_PERCENT_CPU_WALLCLOCK="0" fi if [ -z "$USER_TOTAL_CPUTIME" ] ; then USER_TOTAL_CPUTIME="0" USER_PERCENT_CPU_WALLCLOCK="0" fi # Add this user's percentage of cluster time consumed to the total TOTAL_PERCENT_CPU_WALLCLOCK=`echo "scale=2; $TOTAL_PERCENT_CPU_WALLCLOCK+$USER_PERCENT_CPU_WALLCLOCK" | bc` TOTAL_CPU_WALLCLOCK=`echo "scale=2; $TOTAL_CPU_WALLCLOCK+$USER_TOTAL_WALLCLOCK" | bc` TOTAL_CLUSTER_CPUTIME=`echo "scale=2; $TOTAL_CLUSTER_CPUTIME+$USER_TOTAL_CPUTIME" | bc` # Convert the user's username into their actual name USERNAME=`getent passwd "$i" | cut -d ":" -f 5` # If the passwd file didn't contain the user's actual name, use their username if [ -z "$USERNAME" ] ; then USERNAME="$i" fi # Start the output string OUT="$OUT$USER_PERCENT_CPU_WALLCLOCK \t\t\t" # If the average wallclock and CPU time are less than a millon, add an extra tab # after them to improve the readability of the output table if [ "$USER_TOTAL_WALLCLOCK" -ge "1000000" ] ; then OUT="$OUT$USER_TOTAL_WALLCLOCK \t\t" else OUT="$OUT$USER_TOTAL_WALLCLOCK \t\t\t" fi if [ "$USER_TOTAL_CPUTIME" -ge "1000000" ] ; then OUT="$OUT$USER_TOTAL_CPUTIME \t$USER_USER_JOB_COUNT \t\t\t$USERNAME \n" else OUT="$OUT$USER_TOTAL_CPUTIME \t\t$USER_USER_JOB_COUNT \t\t\t$USERNAME \n" fi # Clean up all variables for the next loop USER_USER_JOB_COUNT="" USER_TOTAL_WALLCLOCK="" USER_TOTAL_CPUTIME="" USER_PERCENT_CPU_WALLCLOCK="" USERNAME="" done # Output the results table, performing a reverse-numerical sort on the results echo -e "% of Wallclock Time\tTotal Wallclock Time\tTotal CPU Time\tTotal Number of Jobs\tUser Name" echo "-------------------------------------------------------------------------------------------------" echo -e $OUT | sort -nr echo "-------------------------------------------------------------------------------------------------" # Calculate the percentage of time the cluster was idle for CLUSTER_IDLE=`echo "scale=2; 100-$TOTAL_PERCENT_CPU_WALLCLOCK" | bc` # Convert the total wallclock time in to a human-readable form TIME="$TOTAL_CPU_WALLCLOCK" gen_human_readable_time TOTAL_CPU_WALLCLOCK="$TIME" # Convert the total CPU time in to a human-readable form TIME="$TOTAL_CLUSTER_CPUTIME" gen_human_readable_time TOTAL_CLUSTER_CPUTIME="$TIME" # Output overall cluster utilisation statistics echo "$TOTAL_JOB_COUNT jobs used $TOTAL_CPU_WALLCLOCK of Wallclock Time and $TOTAL_CLUSTER_CPUTIME of CPU Time have been used during the last $DAYS days, and the cluster was $CLUSTER_IDLE% idle." } function gen_averages() { # Call 'get_userlist' to get a user name list get_userlist # For each username found, do the following; for i in $USER_LIST ; do # Use 'qacct' and 'grep' to count the total number of jobs they've submitted USER_JOB_COUNT=`qacct -o "$i" -j -d $DAYS | grep "jobname" -c` # Add the user's job count to the overall total TOTAL_JOB_COUNT=`expr $TOTAL_JOB_COUNT + $USER_JOB_COUNT` # If the user has submitted no jobs, record their utilisation as zero, # else, use 'qacct', 'grep' and 'cut' to get an array of the wallclock and CPU time # counters for all of their jobs if [ "$USER_JOB_COUNT" = "0" ] ; then USER_JOB_WALLCLOCKS="0" USER_JOB_CPUTIMES="0" else USER_JOB_WALLCLOCKS=`qacct -o "$i" -j -d $DAYS | grep "wallclock " | cut -d " " -f 2` USER_JOB_CPUTIMES=`qacct -o "$i" -j -d $DAYS | grep "cpu " | cut -c 14-100` fi # Set 'COUNT' to zero COUNT="0" # If the user's utilisation isn't zero, calculate their average wallclock time if [ "$USER_JOB_WALLCLOCKS" != "0" ] ; then for a in $USER_JOB_WALLCLOCKS ; do USER_TOTAL_WALLCLOCK=`expr $USER_TOTAL_WALLCLOCK + $a` COUNT=`expr $COUNT + 1` done # Add the user's total wallclock time to the overall total TOTAL_JOB_WALLCLOCK=`expr $TOTAL_JOB_WALLCLOCK + $USER_TOTAL_WALLCLOCK` USER_AVG_JOB_WALLCLOCK=`expr $USER_TOTAL_WALLCLOCK / $COUNT` else USER_AVG_JOB_WALLCLOCK="0" fi # Set 'COUNT' to zero COUNT="0" # If the user's utilisation isn't zero, calculate their average CPU time if [ "$USER_JOB_CPUTIMES" != "0" ] ; then for b in $USER_JOB_CPUTIMES ; do USER_TOTAL_CPUTIME=`expr $USER_TOTAL_CPUTIME + $b` COUNT=`expr $COUNT + 1` done # Add the user's total CPU time to the overall total TOTAL_JOB_CPUTIME=`expr $TOTAL_JOB_CPUTIME + $USER_TOTAL_CPUTIME` USER_AVG_JOB_CPUTIME=`expr $USER_TOTAL_CPUTIME / $COUNT` else USER_AVG_JOB_CPUTIME="0" fi # Convert the user's username into their actual name USERNAME=`getent passwd "$i" | cut -d ":" -f 5` # If the passwd file didn't contain the user's actual name, use their username if [ -z "$USERNAME" ] ; then USERNAME="$i" fi # If the average wallclock and CPU time are less than a million, add an extra tab # after them to improve the readability of the output table if [ "$USER_AVG_JOB_WALLCLOCK" -ge "1000000" ] ; then OUT="$OUT$USER_AVG_JOB_WALLCLOCK \t\t\t" else OUT="$OUT$USER_AVG_JOB_WALLCLOCK \t\t\t\t" fi if [ "$USER_AVG_JOB_CPUTIME" -ge "1000000" ] ; then OUT="$OUT$USER_AVG_JOB_CPUTIME \t\t$USER_JOB_COUNT \t\t\t$USERNAME \n" else OUT="$OUT$USER_AVG_JOB_CPUTIME \t\t\t$USER_JOB_COUNT \t\t\t$USERNAME \n" fi # Clean up all variables for the next loop USER_JOB_COUNT="" USER_JOB_WALLCLOCKS="" USER_JOB_CPUTIMES="" COUNT="" USER_TOTAL_WALLCLOCK="" USER_TOTAL_CPUTIME="" USER_AVG_JOB_WALLCLOCK="" USER_AVG_JOB_CPUTIME="" USERNAME="" done # Output the results table, performing a reverse-numerical sort on the results echo -e "Average Wallclock Time/Job\tAverage CPU Time/Job\tTotal Number of Jobs\tUser Name" echo "-------------------------------------------------------------------------------------------" echo -e $OUT | sort -nr echo "-------------------------------------------------------------------------------------------" # If the total job count is 0, record all overall averages as 0, otherwise calculate them if [ "$TOTAL_JOB_COUNT" = "0" ] ; then TOTAL_AVG_JOB_WALLCLOCK="0" TOTAL_AVG_JOB_CPUTIME="0" else TOTAL_AVG_JOB_WALLCLOCK=`expr $TOTAL_JOB_WALLCLOCK / $TOTAL_JOB_COUNT` TOTAL_AVG_JOB_CPUTIME=`expr $TOTAL_JOB_CPUTIME / $TOTAL_JOB_COUNT` fi # Calculate average job idle time TOTAL_AVG_JOB_IDLE=`echo "scale=2; 100-(($TOTAL_AVG_JOB_CPUTIME/$TOTAL_AVG_JOB_WALLCLOCK)*100)" | bc` # Convert the total wallclock time in to a human-readable form TIME="$TOTAL_AVG_JOB_WALLCLOCK" gen_human_readable_time TOTAL_AVG_JOB_WALLCLOCK="$TIME" # Convert the total CPU time in to a human-readable form TIME="$TOTAL_AVG_JOB_CPUTIME" gen_human_readable_time TOTAL_AVG_JOB_CPUTIME="$TIME" # Output the overall average job statistics echo "The average job during the last $DAYS days took $TOTAL_AVG_JOB_WALLCLOCK to complete, consumed $TOTAL_AVG_JOB_CPUTIME of CPU Time and was idle for $TOTAL_AVG_JOB_IDLE% of the time." } # Check if the first argument was null, and display help and exit if so if [ -z "$DAYS" ] ; then display_help exit else # Else, select from the following options case "$DAYS" in "week") # If 'week' is specified, determine how many days into the week we are, starting from, Monday DAYS=`date +%u` averages_or_totals ;; "month") # If 'month' is specified, determine how many days into the month we are DAYS=`date +%d` averages_or_totals ;; "year") # If 'year' is specified, determine how many days into the year we are DAYS=`date +%j` averages_or_totals ;; (*[0-9]) # Otherwise, if the input conatains numbers, trim out all non-numeric characters and continue DAYS=`echo $DAYS | tr -d [:alpha:][:punct:]` averages_or_totals ;; "help") # If 'help' is specified, display the help message and exit display_help exit ;; *) # If the input is anything else, display the help message and exit display_help exit ;; esac fi
qtime
This script was written to allow people to get an idea of how long their job might have to wait to be executed at any given time. It iterates though all completed jobs from the given time period (week, month or year to date, or a number of days) and calculates the minimum, maximum and average wait times.
On one system I've worked on the average wait time over a month was used as a measurement of system performance in the SLA, with this script providing the figures.
Again, it has a help function with usage examples, but if anyone need pointers feel free to drop me a line.
#!/bin/bash # #========================================================================== # Name: Average Queuing Time # Author: Chris Bingham # Date: 12.02.2009 # Language: Bash # External References: qacct, grep, cut, bc # # This script will calculate the average time jobs have had to spent # queuing before being run over the specified time period #========================================================================== # Store the first argument DAYS="$1" function display_help() { # Display a help message echo "---Average Queuing Time---" echo "This script will calculate the average time jobs have had to spent" echo "queuing before being run over the specified time period" echo "" echo "--Usage--" echo -e " qtime.sh [DAYS|OPTION]" echo -e " Where 'DAYS' is a number of days to calculate the average for or 'OPTION' is one of the following;" echo -e " \tweek\t\t\tCalculate the average for the week so far" echo -e " \tmonth\t\t\tCalculate the average for the month to date" echo -e " \tyear\t\t\tCalculate the average for the year to date" echo -e " \thelp\t\t\tDisplay this message" } function gen_human_readable_time() { # Convert a time in seconds into a more human-friendly scale (hours, days etc instead of seconds) # If the time span is less than 1 hour, convert to minutes if [ "$TIME_INT" -lt "60" ] ; then TIME="$TIME seconds" elif [ "$TIME_INT" -lt "3600" ] ; then TIME=`echo "scale=2; $TIME/60" | bc` TIME="$TIME minutes" # If the time span is less than 1 day, convert to hours elif [ "$TIME_INT" -lt "86400" ] ; then TIME=`echo "scale=2; $TIME/3600" | bc` TIME="$TIME hours" # If the time span is less than 1 week, convert to days elif [ "$TIME_INT" -lt "604800" ] ; then TIME=`echo "scale=2; $TIME/86400" | bc` TIME="$TIME days" # If the time span is less than 1 year, convert to weeks elif [ "$TIME_INT" -lt "31449600" ] ; then TIME=`echo "scale=2; $TIME/604800" | bc` TIME="$TIME weeks" # If the time span is 1 year or more, convert to years else TIME=`echo "scale=2; $TIME/31449600" | bc` TIME="$TIME years" fi } function calc_avg() { # Set the field seperator for array creation to a new line OLDIFS=$IFS IFS=$'\n' # Get information for SGE using the 'qacct' command, storing submit and start times in arrays USER_JOB_COUNT=`qacct -j -d $DAYS | grep "jobname" -c` USER_SUBMIT_TIMES=($(qacct -j -d $DAYS | grep "qsub_time" | cut -d " " -f 5-9)) USER_START_TIMES=($(qacct -j -d $DAYS | grep "start_time" | cut -d " " -f 4-9)) # Get the length of one of the arrays USER_SUBMIT_TIMES_COUNT=${#USER_SUBMIT_TIMES[@]} # Reset the field seperator to it's previous value IFS=$OLDIFS # Create variables to store min and max wait times MIN_WAIT_TIME="" MAX_WAIT_TIME="" # Determine if any jobs have been completed ov the specified time period if [ "$USER_SUBMIT_TIMES_COUNT" -gt "0" ] ; then # If yes, then calculate the average wait time # For each element in the arrays, do the following; for (( i=0; i<${USER_SUBMIT_TIMES_COUNT}; i++ )) ; do # Convert the submit and start time to seconds since the epoch SUBMIT_SECONDS=`date -d "${USER_SUBMIT_TIMES[$i]}" +%s` START_SECONDS=`date -d "${USER_START_TIMES[$i]}" +%s` # Calculate how long the job was queuing, and add this to the total queuing time WAIT_TIME=$(($START_SECONDS-$SUBMIT_SECONDS)) TOTAL_WAIT_TIME=$((TOTAL_WAIT_TIME+$WAIT_TIME)) if [ -z "$MIN_WAIT_TIME" ] ; then MIN_WAIT_TIME=$WAIT_TIME MAX_WAIT_TIME=$WAIT_TIME else if [ "$MIN_WAIT_TIME" -gt "$WAIT_TIME" ] ; then MIN_WAIT_TIME=$WAIT_TIME fi if [ "$MAX_WAIT_TIME" -lt "$WAIT_TIME" ] ; then MAX_WAIT_TIME=$WAIT_TIME fi fi # Reset all variables for the next iteration of the loop WAIT_TIME="" SUBMIT_SECONDS="" START_SECONDS="" done # Calculate the average queuing time as both an integer and floating point number AVG_WAIT_TIME=`echo "scale=2; $TOTAL_WAIT_TIME/$USER_SUBMIT_TIMES_COUNT" | bc` AVG_WAIT_TIME_INT=$(($TOTAL_WAIT_TIME/$USER_SUBMIT_TIMES_COUNT)) TIME_INT=$AVG_WAIT_TIME_INT TIME=$AVG_WAIT_TIME gen_human_readable_time AVG_WAIT_TIME=$TIME TIME_INT=$MIN_WAIT_TIME TIME=$MIN_WAIT_TIME gen_human_readable_time MIN_WAIT_TIME=$TIME TIME_INT=$MAX_WAIT_TIME TIME=$MAX_WAIT_TIME gen_human_readable_time MAX_WAIT_TIME=$TIME # Display the average queuing time echo "" echo "During the last $DAYS days, jobs had to queue (wait to be run) for;" echo -e "\tOn average:\t$AVG_WAIT_TIME" echo -e "\tAt least:\t$MIN_WAIT_TIME" echo -e "\tAt most:\t$MAX_WAIT_TIME" echo "" else # If no, then display the wait time as 0 seconds echo "" echo "During the last $DAYS days, jobs had to queue (wait to be run) for;" echo -e "\tOn average:\t0 seconds" echo -e "\tAt least:\t0 seconds" echo -e "\tAt most:\t0 seconds" echo "" fi } # Check first argument case "$DAYS" in "week") # If 'week' is specified, determine how many days into the week we are, starting from, Monday DAYS=`date +%u` calc_avg ;; "month") # If 'month' is specified, determine how many days into the month we are DAYS=`date +%d` calc_avg ;; "year") # If 'year' is specified, determine how many days into the year we are DAYS=`date +%j` calc_avg ;; (*[0-9]) # If the input conatains numbers, trim out all non-numeric characters and continue DAYS=`echo $DAYS | tr -d [:alpha:][:punct:]` calc_avg ;; "help") # If 'help' is specified, display the help message and exit display_help exit ;; *) # If the input is anything else, display the help message and exit display_help exit ;; esac
Queue Job Count
This one's quite simple - it determines what queues exist on the cluster, then counts up how many jobs have been submitted to each one over the specified period (week, month or year to date, or a number of days).
#!/bin/bash # #========================================================================== # Name: SGE Queue Job Count # Author: Chris Bingham # Date: 28.11.2008 # Language: Bash # External References: qconf, qacct, grep, date, tr # # This script will use the SGE command 'qconf' to get a list of queues # configured on the cluster, then use 'qacct' to search back though job # records for each queue for the specified number of days, and return a # table of jobs counts for each queue, sorted highest to lowest #========================================================================== # Store the first argument, all other argument will be discarded DAYS=$1 # Use 'qacct' to get a list of queues, and convert it to an array QUEUE_LIST=`qconf -sql` QUEUE_LIST=`echo $QUEUE_LIST | tr -t ' ' " " ` # Create a variable to store the total job count TOTAL_JOB_COUNT="0" function get_job_count() { # For each queue found, do the following; for q in `echo -e $QUEUE_LIST` ; do # Use 'qacct' and 'grep' to get a count of the number of jobs for the # specified time period QUEUE_JOB_COUNT=`qacct -d $DAYS -q $q -j | grep "qname $q" -c` # Add this to the total job count TOTAL_JOB_COUNT=$(($TOTAL_JOB_COUNT+$QUEUE_JOB_COUNT)) # Store the results for later output OUT="$OUT$QUEUE_JOB_COUNT\t\t$q\n" done # Output the results table, performing a reverse-numerical sort on the results echo -e "Job Count\tQueue" echo "--------------------------" echo -e $OUT | sort -nr echo "Total Job Count: $TOTAL_JOB_COUNT" } function display_help() { # Display a help message echo "---SGE Queue Job Count---" echo "This script will use the SGE command 'qacct' to search back though all the" echo "job records for the specified number of days, and return a table of the" echo "job counts for each queue configured on the system." echo "" echo "Usage: q_job_count.sh [DAYS|OPTION]" echo "Where 'DAYS' is a number of days to gather statistics for or 'OPTION' is one of the following;" echo -e "\tweek\tGather statistics for the week so far" echo -e "\tmonth\tGather statistics for the month to date" echo -e "\tyear\tGather statistics for the year to date" echo -e "\thelp\tDisplay this message" } # Check if the first argument was null, and display help and exit if so if [ -z "$DAYS" ] ; then display_help exit else # Else, select from the following options case "$DAYS" in "week") # If 'week' is specified, determine how many days into the week we are, starting from, Monday DAYS=`date +%u` get_job_count ;; "month") # If 'month' is specified, determine how many days into the month we are DAYS=`date +%d` get_job_count ;; "year") # If 'year' is specified, determine how many days into the year we are DAYS=`date +%j` get_job_count ;; (*[0-9]) # Otherwise, if the input conatains numbers, trim out all non-numeric characters and continue DAYS=`echo $DAYS | tr -d [:alpha:][:punct:]` get_job_count ;; "help") # If 'help' is specified, display the help message and exit display_help exit ;; *) # If the input is anything else, display the help message and exit display_help exit ;; esac fi
Change Queue State
A very short script that will either enable or disable all queue instances on the host its run on - I've found it useful for quickly knocking out the queues on nodes that are being taken down for maintenance.
#!/bin/bash # #========================================================================== # Name: Change SGE Queue Instance States # Author: Chris Bingham # Date: 28.11.2008 # Language: Bash # External References: qselect, qmod, grep, tr # # This script will, based on the argument supplied, either enable or disable # all queue instances on the current host #========================================================================== # Store the first argument, all other argument will be discarded ACTION=$1 # Convert any uppercase letters to lowercase, for the case statement below ACTION=`echo $ACTION | tr -t [:upper:] [:lower:]` # Determine what action to take based on the argument supplied case "$ACTION" in "enable") # If the argument was 'enable', then use 'qselect' and 'grep' to select all # queue instances on the current host (except 'test.q') and then enable them # using 'qmod' qmod -e `qselect -q *@$HOSTNAME` ;; "disable") # If the argument was 'disable', then use 'qselect' and 'grep' to select all # queue instances on the current host (except 'test.q') and then disable them # using 'qmod' qmod -d `qselect -q *@$HOSTNAME` ;; *) # If the argument was anything else, display an error message echo "Invalid option: please enter either 'enable' or 'disable'" ;; esac
filter-accounting
A small Perl script to filter the accounting file by the end_time of the jobs. Mostly useful for splitting up an accounting file by years, for example.
Other Miscellaneous
Some others, partially overlapping with some above are listed under http://www.nw-grid.ac.uk/LivScripts.