Using Ganglia As Load Sensor

From GridWiki
Jump to: navigation, search

Many Grid Engine users have an implementation of Ganglia cluster monitoring system for their cluster/grid. The script at the bottom of the page is a load sensor that converts the XML output from Ganglia's gmond daemon into the format needed for a Grid Engine load sensor.

Most of the load parameters reported by gmond are duplicate content (though named differently) from those already available from Grid Engine. The major advantage is the extensible nature of Ganglia in terms of using their gmetric program.

  • Save the script at the bottom of the page to a directory accessible to your slave nodes.
  • Assign as load sensor script (if you want it for all nodes, use global in place of node01 below).
qconf -mconf node01
 ...
 load_sensor    /path/to/gmond_load_sensor.sh
 ...
  • Add a item to the complexes. The complex definition below is an example to report the os_release.
qconf -mc
 ...
 #name               shortcut      type        relop requestable consumable default  urgency
 #-------------------------------------------------------------------------------------------
 os_release          os_release    CSTRING     ==    YES         NO         NONE     0
 ...
  • Test
qhost -F os_release
 ...
 node01                  lx24-amd64      2  2.02    2.0M  254.7M    2.0M  144.0K
    Host Resource(s):      hl:os_release=2.6.9-11.ELsmp
 ...

Since our added complex is requestable, you can include in the job submission. For example: qsub -l os_release=2.6.9-11.ELsmp myjob.sh

gmond_load_sensor.sh

#!/bin/sh
#
# (c) 2007 Tim Cera

# GPLv2

# example for a load sensor script that uses gmond information from the ganglia
# project.

#
# Be careful: Load sensor scripts are started with root permissions.
# In an admin_user system euid=0 and uid=admin_user
#

# Change to the port configured in /etc/gmond.conf.  Ganglia default is 8649.
ganglia_port=8649

telnet=`which telnet`

PATH=/bin:/usr/bin

ARCH=`$SGE_ROOT/util/arch`
HOST=`$SGE_ROOT/utilbin/$ARCH/gethostname -name`

end=false
while [ $end = false ]; do

  # ----------------------------------------
  # wait for an input
  #
  read input
  result=$?
  if [ $result != 0 ]; then
     end=true
     break
  fi

  if [ "$input" = "quit" ]; then
     end=true
     break
  fi

  # ----------------------------------------
  # send mark for begin of load report
  echo "begin"
  xml2load=`$telnet localhost $ganglia_port 2> /dev/null | \
            sed \
                -e "/HOST.*${HOST}/,/HOST>/ ! d" \
                -e "s/^.*<METRIC NAME=\"/${HOST}:/g" \
                -e '/HOST/ d' \
                -e 's/" VAL="/:/g' \
                -e 's/" TYPE.*>//g' \
                `
  # Add whatever changes you need to coordinate with GridEngine
  # example
  #             -e 's/load_one/load_short/g' \
  # OR you can add the complex 'load_one'

  # ----------------------------------------
  # send load information
  for line in $xml2load
  do
    echo $line
  done

  # ----------------------------------------
  # send mark for end of load report
  echo "end"

done