GridEngine XML

From GridWiki
Jump to: navigation, search

HTML Schema documentation for qstat XML schemas

Grid Engine 6.x distributions include a "util/resources/schemas/qstat/" directory that currently contains the following files:

* qstat.xsd
* message.xsd
* detailed_job_info.xsd

These are about the best resources one can currently obtain when delving deep into SGE's XML output behavior. They are, however, a bit cryptic to read. Passing the .xsd files through an XML Schema Documentation Generator has resulted in some more human readable output. The translated files can be found here:

 * http://gridengine.info/files/qstat.xsd.html
 * http://gridengine.info/files/message.xsd.html
 * http://gridengine.info/files/detailed_job_info.xsd.html

Documenting binary bitmask math for JAT_state values

Mainly covered here so far: http://gridengine.info/articles/2005/11/03/gridengine-xml-translating-jat_state-values-into-useful-information


Documenting "qstat -xml" job status output

I am creating this page to document some of the confusing issues I'm facing regarding qstat XML output. The main issue is the large number of tersely named elements and attributes, some of which are named rather similarly.


SGE 6.0u7 (and later)

Non-XML Example: normal 'simple.sh' job

$ qstat -j 20
==============================================================
job_number:                 20
exec_file:                  job_scripts/20
submission_time:            Fri Dec 23 10:27:25 2005
owner:                      dag
uid:                        501
group:                      dag
gid:                        501
sge_o_home:                 /Users/dag
sge_o_log_name:             dag
sge_o_path:                 /opt/sge60s2/bin/darwin:/sw/bin:/sw/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/X11R6/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              /Users/dag/sgetest
sge_o_host:                 chrisdag-wireless
account:                    sge
cwd:                        /Users/dag/sgetest
path_aliases:               /tmp_mnt/ * * /
mail_list:                  dag@chrisdag-wireless.private.sonsorol.net
notify:                     FALSE
job_name:                   simple.sh
jobshare:                   0
shell_list:                 /bin/sh
env_list:                   
script_file:                ./simple.sh
usage    1:                 cpu=00:00:00, mem=0.00000 GBs, io=0.00000, vmem=N/A, maxvmem=N/A
scheduling info:            queue instance "alarm.q@chrisdag.local" dropped because it is temporarily not available
                            queue instance "all.q@chrisdag.local" dropped because it is temporarily not available
                            queue instance "disabled.q@chrisdag.local" dropped because it is temporarily not available
                            queue instance "alarm.q@chrisdag-wireless.private.sonsorol.net" dropped
 because it is overloaded: np_load_avg=1.244238 (= 0.844238   0.50 * 0.800000 with nproc=1) >= 0.3
                            queue instance "disabled.q@chrisdag-wireless.private.sonsorol.net" 
dropped because it is disabled

XML Example: normal 'simple.sh' job

Outstanding questions:

  1. What do each of these state/status elements mean, and which one is used to generate the "status" code when "qstat -f" is run without XML output enabled? What is the difference between JB_ja_template and JB_ja_task? When does one matter more than another? How do job arrays and parallel tasks affect these values?
    • /qmaster_response/JB_ja_template/ulong_sublist/JAT_status = 0
    • /qmaster_response/JB_ja_template/ulong_sublist/JAT_state = 2112
    • /qmaster_response/JB_ja_tasks/ulong_sublist/JAT_status = 128
    • /qmaster_response/JB_ja_tasks/ulong_sublist/JAT_state = 128
  2. Why does plaintext "qstat" output show "sge_o_home" and the XML attribute show "__SGE_PREFIX__O_HOME" Should these be consistant? -- The reason is that __SGE_PREFIX__O_HOME is the internal representation, and sge_o_home is what the pretty printer in qstat writes. Since the XML generator in qstat just turns the internal representation into XML tags and content, it comes out as is.
  3. What format is <JAT_start_time>1135356727</JAT_start_time> in? Unix EPOCH seconds?

NB: with the newer versions (eg, 6.1u2), the JAT_start_time is ISO-8601

$ qstat -j 20 -xml
<?xml version='1.0'?>
<detailed_job_info  xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <djob_info>
    <qmaster_response>
      <JB_job_number>20</JB_job_number>
      <JB_job_name>simple.sh</JB_job_name>
      <JB_version>0</JB_version>
      <JB_session>JAPI_SSK</JB_session>
      <JB_department>defaultdepartment</JB_department>
      <JB_exec_file>job_scripts/20</JB_exec_file>
      <JB_script_file>./simple.sh</JB_script_file>
      <JB_script_size>0</JB_script_size>
      <JB_submission_time>1135351645</JB_submission_time>
      <JB_execution_time>0</JB_execution_time>
      <JB_deadline>0</JB_deadline>
      <JB_owner>dag</JB_owner>
      <JB_uid>501</JB_uid>
      <JB_group>dag</JB_group>
      <JB_gid>501</JB_gid>
      <JB_account>sge</JB_account>
      <JB_cwd>/Users/dag/sgetest</JB_cwd>
      <JB_notify>false</JB_notify>
      <JB_type>0</JB_type>
      <JB_reserve>false</JB_reserve>
      <JB_priority>1024</JB_priority>
      <JB_jobshare>0</JB_jobshare>
      <JB_shell_list>
        <path_list>
          <PN_path>/bin/sh</PN_path>
          <PN_host></PN_host>
          <PN_file_host></PN_file_host>
          <PN_file_staging>false</PN_file_staging>
        </path_list>
      </JB_shell_list>
      <JB_verify>0</JB_verify>
      <JB_env_list>
        <job_sublist>
          <VA_variable>__SGE_PREFIX__O_HOME</VA_variable>
          <VA_value>/Users/dag</VA_value>
        </job_sublist>
        <job_sublist>
          <VA_variable>__SGE_PREFIX__O_LOGNAME</VA_variable>
          <VA_value>dag</VA_value>
        </job_sublist>
        <job_sublist>
          <VA_variable>__SGE_PREFIX__O_PATH</VA_variable>
          <VA_value>/opt/sge60s2/bin/darwin:/sw/bin:/sw/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/X11R6/bin</
VA_value>
        </job_sublist>
        <job_sublist>
          <VA_variable>__SGE_PREFIX__O_SHELL</VA_variable>
          <VA_value>/bin/bash</VA_value>
        </job_sublist>
        <job_sublist>
          <VA_variable>__SGE_PREFIX__O_HOST</VA_variable>
          <VA_value>chrisdag-wireless</VA_value>
        </job_sublist>
        <job_sublist>
          <VA_variable>__SGE_PREFIX__O_WORKDIR</VA_variable>
          <VA_value>/Users/dag/sgetest</VA_value>
        </job_sublist>
      </JB_env_list>
      <JB_checkpoint_attr>0</JB_checkpoint_attr>
      <JB_checkpoint_object></JB_checkpoint_object>
      <JB_checkpoint_interval>0</JB_checkpoint_interval>
      <JB_restart>0</JB_restart>
      <JB_merge_stderr>false</JB_merge_stderr>
      <JB_mail_options>0</JB_mail_options>
      <JB_mail_list>
        <element>
          <MR_user>dag</MR_user>
          <MR_host>chrisdag-wireless.private.sonsorol.net</MR_host>
        </element>
      </JB_mail_list>
      <JB_ja_structure>
        <task_id_range>
          <RN_min>1</RN_min>
          <RN_max>1</RN_max>
          <RN_step>1</RN_step>
        </task_id_range>
      </JB_ja_structure>
      <JB_ja_template>
        <ulong_sublist>
          <JAT_task_number>1</JAT_task_number>
          <JAT_status>0</JAT_status>
          <JAT_start_time>0</JAT_start_time>
          <JAT_end_time>0</JAT_end_time>
          <JAT_hold>0</JAT_hold>
          <JAT_job_restarted>0</JAT_job_restarted>
          <JAT_state>2112</JAT_state>
          <JAT_pvm_ckpt_pid>0</JAT_pvm_ckpt_pid>
          <JAT_pending_signal>0</JAT_pending_signal>
          <JAT_pending_signal_delivery_time>0</JAT_pending_signal_delivery_time>
          <JAT_pid>0</JAT_pid>
          <JAT_fshare>0</JAT_fshare>
          <JAT_tix>0.000000</JAT_tix>
          <JAT_oticket>0.000000</JAT_oticket>
          <JAT_fticket>0.000000</JAT_fticket>
          <JAT_sticket>0.000000</JAT_sticket>
          <JAT_share>0.000000</JAT_share>
          <JAT_suitable>0</JAT_suitable>
          <JAT_pe_object></JAT_pe_object>
          <JAT_next_pe_task_id>0</JAT_next_pe_task_id>
          <JAT_stop_initiate_time>0</JAT_stop_initiate_time>
          <JAT_prio>0.555000</JAT_prio>
          <JAT_ntix>0.500000</JAT_ntix>
        </ulong_sublist>
      </JB_ja_template>
      <JB_ja_tasks>
        <ulong_sublist>
          <JAT_task_number>1</JAT_task_number>
          <JAT_status>128</JAT_status>
          <JAT_start_time>1135356727</JAT_start_time>
          <JAT_end_time>0</JAT_end_time>
          <JAT_hold>0</JAT_hold>
          <JAT_job_restarted>0</JAT_job_restarted>
          <JAT_granted_destin_identifier_list>
            <element>
              <JG_qname>alarm.q@chrisdag-wireless.private.sonsorol.net</JG_qname>
              <JG_qversion>0</JG_qversion>
              <JG_qhostname>chrisdag-wireless.private.sonsorol.net</JG_qhostname>
              <JG_slots>1</JG_slots>
              <JG_queue></JG_queue>
              <JG_tag_slave_job>0</JG_tag_slave_job>
              <JG_task_id_range>0</JG_task_id_range>
              <JG_ticket>0.000000</JG_ticket>
              <JG_oticket>0.000000</JG_oticket>
              <JG_fticket>0.000000</JG_fticket>
              <JG_sticket>0.000000</JG_sticket>
              <JG_jcoticket>0.000000</JG_jcoticket>
              <JG_jcfticket>0.000000</JG_jcfticket>
            </element>
          </JAT_granted_destin_identifier_list>
          <JAT_master_queue>alarm.q@chrisdag-wireless.private.sonsorol.net</JAT_master_queue>
          <JAT_state>128</JAT_state>
          <JAT_pvm_ckpt_pid>0</JAT_pvm_ckpt_pid>
          <JAT_pending_signal>0</JAT_pending_signal>
          <JAT_pending_signal_delivery_time>0</JAT_pending_signal_delivery_time>
          <JAT_pid>0</JAT_pid>
          <JAT_usage_list>
            <element>
              <UA_name>cpu</UA_name>
              <UA_value>0.000000</UA_value>
            </element>
            <element>
              <UA_name>mem</UA_name>
              <UA_value>0.000000</UA_value>
            </element>
            <element>
              <UA_name>io</UA_name>
              <UA_value>0.000000</UA_value>
            </element>
            <element>
              <UA_name>iow</UA_name>
              <UA_value>0.000000</UA_value>
            </element>
          </JAT_usage_list>
          <JAT_scaled_usage_list>
            <scaled>
              <UA_name>cpu</UA_name>
              <UA_value>0.000000</UA_value>
            </scaled>
            <scaled>
              <UA_name>mem</UA_name>
              <UA_value>0.000000</UA_value>
            </scaled>
            <scaled>
              <UA_name>io</UA_name>
              <UA_value>0.000000</UA_value>
            </scaled>
            <scaled>
              <UA_name>iow</UA_name>
              <UA_value>0.000000</UA_value>
            </scaled>
          </JAT_scaled_usage_list>
          <JAT_fshare>0</JAT_fshare>
          <JAT_tix>0.000000</JAT_tix>
          <JAT_oticket>0.000000</JAT_oticket>
          <JAT_fticket>0.000000</JAT_fticket>
          <JAT_sticket>0.000000</JAT_sticket>
          <JAT_share>0.000000</JAT_share>
          <JAT_suitable>0</JAT_suitable>
          <JAT_pe_object></JAT_pe_object>
          <JAT_next_pe_task_id>1</JAT_next_pe_task_id>
          <JAT_stop_initiate_time>0</JAT_stop_initiate_time>
          <JAT_prio>0.555000</JAT_prio>
          <JAT_ntix>0.500000</JAT_ntix>
        </ulong_sublist>
      </JB_ja_tasks>
      <JB_host></JB_host>
      <JB_verify_suitable_queues>0</JB_verify_suitable_queues>
      <JB_nrunning>0</JB_nrunning>
      <JB_soft_wallclock_gmt>0</JB_soft_wallclock_gmt>
      <JB_hard_wallclock_gmt>0</JB_hard_wallclock_gmt>
      <JB_override_tickets>0</JB_override_tickets>
      <JB_path_aliases>
        <PathAliases>
          <PA_origin>/tmp_mnt/</PA_origin>
          <PA_submit_host>*</PA_submit_host>
          <PA_exec_host>*</PA_exec_host>
          <PA_translation>/</PA_translation>
        </PathAliases>
      </JB_path_aliases>
      <JB_urg>1000.000000</JB_urg>
      <JB_nurg>0.500000</JB_nurg>
      <JB_nppri>0.500000</JB_nppri>
      <JB_rrcontr>1000.000000</JB_rrcontr>
      <JB_dlcontr>0.000000</JB_dlcontr>
      <JB_wtcontr>0.000000</JB_wtcontr>
    </qmaster_response>
  </djob_info>
  <messages>
    <qmaster_response>
      <SME_global_message_list>
        <element>
          <MES_message_number>40</MES_message_number>
          <MES_message>queue instance