StephansBlog

From GridWiki
Jump to: navigation, search

N1GE 6 - Scheduler Hacks: "least used" / "fill up" configuration

I also want to use this blog to talks about some "hacks" around the N1 Grid Engine 6 software and its scheduler. The scheduler in the Grid Engine project is in theory a very comfortable tool. It makes the decision for the user where to run the jobs. The user only has to specify a couple of constraints which have to be meet by the execution host. However, there are some configuration settings, which are not very intuitive. The one I want to talk about today is the configuration of"least used host first" and "fill up host".

I will not go into detail of the Grid Engine terminology. I also assume that the reader has a basic understanding of the N1 Grid Engine 6 software.

The default setting for the scheduler is to distribute jobs load based.It looks at every host and assigns a job to the host with the least load.This is not always desired. There are use cases, where the scheduler should distribute the jobs equally over all available hosts and only assign multiple jobs to one host, when all hosts in use or in contrary, fill up a host first before assigning jobs to the next host.

I think that the equal distribution will be the more usable use-case.For example: It is useful in the case of over-subscripting the hosts in the grid. "Over-subscription" means that one host will execute more jobs than it has CPUs. The setting "use least used host first" ensures that all available CPUs are used first, before Grid Engine starts to over-subscribe a host.

I will setup a grid with two hosts ("host A", "host B") and one cluster queue "all.q" in my example. The hosts are referenced in the host group "@allhosts".

qconf -sq all.q will show (reduced to the important details):

 qname all.q hostlist @allhosts slots 1,[host_A=4],[host_B=4]

We see that each host can run 4 jobs at the same time. To prepare the hosts for "least used host first" or "fill up" we have to configure:

qconf -me host_A and set complex_values slots=4:

 hostname host_A load_scaling NONE complex_values slots=4

We do the same setting for host_B (qconf -me host_B)

This setting is needed because the scheduler distributes jobs to the hosts based on load values. A load value can be an external script, which reports values (such as: load_avg, mem_use, ...) or a consumable. We use a consumable for our configuration. To give the scheduler access to the value, we need to define it for each host, as we just did. The N1 Grid Engine 6 software will now count the running jobs not only on a queue level but also on host level. If the sum of slots for all queue instances on a given host is bigger than the defined value for that host, the scheduler will limit the number of running jobs on that host to the defined value in the host configuration. If we have less slots in all queue instances on that host than defined for that host, the number of running jobs will not exceed the number of slots in the queue instances.

After we did the preparation, we need to tell the scheduler to use the least used host first or to fill it up.

To enable "use least used host first" we configure: qconf -msconf and set "queue_sort_method load" and "load_formula -slots".

 algorithm                  default
 schedule_interval          0:2:0
 maxujobs                   0
 queue_sort_method          load
 job_load_adjustments       NONE
 load_adjustment_decay_time 0:0:0
 load_formula               -slots
 schedd_job_info            true
 flush_submit_sec           1
 flush_finish_sec           1

To enable "fill up host" we configure: 'qconf -msconf' and set "queue_sort_method load" and "load_formula slots".

 algorithm                  default
 schedule_interval          0:2:0
 maxujobs                   0
 queue_sort_method          load
 job_load_adjustments       NONE
 load_adjustment_decay_time 0:0:0
 load_formula               slots
 schedd_job_info            true
 flush_submit_sec           1
 flush_finish_sec           1

If performance is a critical to you, I am not sure, that I can recommend this configuration. If that is teh case please use profiling to validate the performance impact.

Well, having setup the scheduler this way, one might wounder how this setting works together with the parallel environment (pe) allocation rule. The default setting is, what ever is specified in the pe, overwrites the scheduler configuration. Only if "pe_slots"is set as an allocation rule, the scheduler configuration is used.

Links:


N1GE 6 - Scheduler Hacks: Comment on the qmaster <-> scheduler protocol

If one is using any of the ticket policies one will most likely see something similar to:

 04/12/2005 09:16:28|qmaster|xxx|E| orders user/project version (16468) is not uptodate (16469) for user/project "PRJ147"


in the qmaster messages file ($SGE_CELL/spool/qmaster/messages). I would like to explain this can happen and why it is not necessary a bug when these messages are logged. The scheduler is implemented as an event client. This means that it will receive an event when ever an object in the qmaster is added, removed, or modified. These events are usually delivered to the event clients right away or with a delay that the event client can specify. In the case of the scheduler, it is every scheduling_interval (default 15s). The event delivery does not only update the data in the scheduler but also triggers a scheduling run. Depending on the amount of jobs and the complexity of the jobs it can take a while before a scheduling run has finished. With a couple 10k jobs in the system it might take longer than the scheduling interval. In this case, a second event client configuration setting is activated. It allows to specify what the event master should do, when events are not acknowledged or the client is busy. In case of the scheduler no events are send while the event client is marked as busy. This means, that the scheduling data will not be updated during a scheduling run. It can happen, that a administrator is modifying an object during a scheduling run. This will lead to the error message we saw in the beginning. After every scheduling run send the scheduler a package of orders to the qmaster. While the qmaster executes the orders it validates them and ensures that the affected objects did not change. If such a change is detected the order will be ignored and we see an error message that the order failed.

Commands which might lead to the error message:

  • qconf -mq // modify a queue
  • qmod // change a queue
  • qconf -clearusage
  • qconf -mprj // modify a project

and others.

Due to bugs in the event master, these error messages were logged quite frequent in older version (N1GE 6.0 FCS, u1, u2, and u3). Though, if nobody changed anything and these error messages are logged, one might have found a bug.


N1GE 6 - Scheduler Hacks: job execution priority

The nice level of a job can be set in different ways. The simple way is to turn the reprioritization feature off (it is the default setting) and set the nice level via the queue configuration.

% qconf -mq all.q

  priority  0

All jobs running in the queue instance will run with defined nice level. One can now easily configure different cluster queues (such as low, medium, and high priority) with different nice levels.

This is the easy way and allows the user to decide how important the job is by submitting it to a specific cluster queue.

This approach is not always fine grained enough. Sometimes it is important to rank the jobs based on the scheduling priority. A high priority job should not only be scheduled as fast as possible but also run on a lower nice level than low priority jobs. The importance ranking for the scheduling decision is done via the ticket policy and others. But only the ticket policy has a direct impact on the job nice level when "reprioritize" is enabled. There are two places to enable and control job reprioritization:

% qconf -mconf

  reprioritize  0

% qconf -msconf

  reprioritize_interval  0:0:0

One could assume, that one can also influence the reprioritization via:

% qconf -mconf <host_name>

but, even though the setting is accepted, if does not have an effect. The "reprioritize" flag enables/disables the feature. If it sets to true, the execd will monitor the usage of each job that it is running. It knows the amount of tickets for each job and will ensure, that the ticket ratio between the jobs is the same ratio as the usage between the jobs. Every job gets initial start tickets. The scheduler will most certainly change them while the job is running. Therefore we have the reprioritize_interval, which will update the jobs on the execd side and ensure that the ratio between the usage reflects the ratio between the tickets via the nice level. Since it takes some time to adjust the jobs' usage via the nice level, the tickets should not be send too often. The recommendation is 2 minutes for the reprioritize_interval.

If the reprioritize_interval is set to 0:0:0, the reprioritize feature is disabled (e.q. reprioritize is set 0). It also works the other way around, setting reprioritize_interval enables to feature by setting reprioritize to 1.

A sample setup with two projects:

 PRJ10 100 functional shares
 PRJ1   10 functional shares

qstat shows:

 JobId     P      S   Project  Tot-Tkt   ovrts   otckt  ftckt   stckt   
 --------------------------------------------------
 223670 1.50000   qw    PRJ10   25000       0       0   25000       0
 223671 0.59091   qw     PRJ1    2272       0       0    2272       0
 223672 0.50000    r       NA       0       0       0       0       0
 223673 0.50000    r       NA       0       0       0       0       0
 223674 0.50000    r       NA       0       0       0       0       0

Top output (note the changes in the nice level):

1)

   PID USERNAME THR PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
 11137 sg144703   1  40  -10 4240K 3352K cpu/3    2:35 21.93% work
 11139 sg144703   1  37   -9 4240K 3352K cpu/2    2:31 21.10% work
 11749 sg144703   1   0   17 4240K 3352K cpu/0    1:07 17.66% work
 11743 sg144703   1   0   12 4240K 3352K run      0:54 17.20% work
 11751 sg144703   1   0   19 4240K 3352K run      1:04 14.00% work

2)

   PID USERNAME THR PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
 11137 sg144703   1  30  -10 4240K 3352K cpu/1    3:23 23.92% work
 11139 sg144703   1  27   -9 4240K 3352K cpu/2    3:19 23.41% work
 11743 sg144703   1   0   19 4240K 3352K run      1:21 16.28% work
 11751 sg144703   1   0   18 4240K 3352K cpu/3    1:36 15.28% work
 11749 sg144703   1   8   17 4240K 3352K run      1:37 12.66% work

3)

 11137 sg144703   1  30  -10 4240K 3352K run      4:30 24.02% work
 11139 sg144703   1  27   -9 4240K 3352K cpu/1    4:26 23.92% work
 11751 sg144703   1   0   19 4240K 3352K cpu/3    2:25 16.32% work
 11749 sg144703   1   0   15 4240K 3352K run      2:17 13.70% work
 11743 sg144703   1   0   17 4240K 3352K run      1:56 11.83% work

And the qstat usage output:

 job-id project          department state cpu        mem       io    tckts ovrts otckt ftckt stckt
 ----------------------------------------------------------------------
 223670 PRJ10            defaultdep r     0:00:04:41 1.13824 0.00000 90909     0     0 90909     0
 223671 PRJ1             defaultdep r     0:00:04:37 1.11933 0.00000  9090     0     0  9090     0
 223672 NA               defaultdep r     0:00:02:04 0.50110 0.00000     0     0     0     0     0
 223673 NA               defaultdep r     0:00:02:25 0.58774 0.00000     0     0     0     0     0
 223674 NA               defaultdep r     0:00:02:29 0.60243 0.00000     0     0     0     0     0

The machine is used for this test had 4 processors and there were always enough CPUs for the PRJ10 and PRJ1 job. Therefore they have more or less the same usage. The others are way behind. They have to share the resources with the other tasks and are way behind. The min / max values for the nice level are defined in the source file: source/daemons/execd/ptf.h

A different job mix results in different nice levels:

qstat output:

 JobId     P      S   Project  Tot-Tkt   ovrts   otckt  ftckt   stckt  
 ----------------------------------------------------
 223675 1.50000    r    PRJ10   30303       0       0   30303       0
 223676 1.50000    r    PRJ10   30303       0       0   30303       0
 223677 1.50000    r    PRJ10   30303       0       0   30303       0
 223678 0.80000    r     PRJ1    9090       0       0    9090       0
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

top:

1)

 21625 sg144703   1  40  -10 4240K 3352K cpu/1    1:32 20.61% work
 21589 sg144703   1  47   -9 4240K 3352K run      1:34 20.50% work
 21590 sg144703   1  37   -9 4240K 3352K cpu/0    1:34 20.30% work
 21633 sg144703   1   0   16 4240K 3352K run      1:21 18.31% work

The used nice range might be a bit extrem. There are two switches to specify the range. The settings PTF_MIN_PRIORITY and PTF_MAX_PRIORITY allow to control used nice range. It can be set via:

% qconf -mconf

 execd_params       PTF_MIN_PRIORITY=19, PTF_MAX_PRIORITY=0


N1GE 6 - Scheduler Hacks: The ticket policy hierarchy

The N1GE 6 support fair share scheduling through its ticket policy. The ticket policy consists of three different parts:

  • the Share Tree policy
  • the functional policy
  • the override policy

The share tree and the functional policy have a fixed amount of tickets, which gets distributed over all jobs in the system. The override policy is open. If "share override tickets" or "share functional tickets" are enabled, the ticket amount of a job depends in its submission time. The ticket amount in the share tree always depends on the submission time as well as the past usage. Of course to all three of them depend on the assignments from the configuration.

The statement, that the jobs' ticket amount depends on the submission time is a bit simple. An additional parameter comes into play through the Ticket Policy Hierarchy. It specifies the order in which the different ticket policies are called. The default is: OFS, which means:

  1. O = override ticket policy
  2. F = functional ticket policy
  3. S = share tree policy

The real dependency for the final tickets are:

compute override tickets:

  • assigned tickets by the configuration
  • job submission time

compute functional tickets:

  • assigned tickets by the configuration
  • assigned override tickets
  • job submission time

compute share tree tickets:

  • assigned tickets by the configuration
  • usage
  • assigned override tickets
  • assigned functional tickets.
  • job submission time

As you see, the previous computed tickets have an effect on the following ticket policy. The next example will demonstrate it.

Setup:

% qconf -msconf

  ticket_policy_hierarchy           OFS
  weight_tickets_functional         100000
  weight_tickets_share              0
  weight_ticket                     1.000000
  weight_waiting_time               0.000000
  weight_deadline                   3600000.000000
  weight_urgency                    0.000000
  weight_priority                   0.000000

% qconf -aprj PRJ1

  name PRJ1
  oticket 10
  fshare 0
  acl NONE
  xacl NONE

% qconf -muser <NAME>

  fshare 100

Jobs:

7 jobs without a project:

% qsub $SGE_ROOT/examples/jobs/sleeper.sh

2 PRJ1 jobs:

% qsub -P PRJ1 $SGE_ROOT/examples/jobs/sleeper.sh

qstat output:

 JobId     P      S   Project  Tot-Tkt   ovrts   otckt  ftckt   stckt   shr
 --------------------------------------------------------
 223690 1.00000   qw     PRJ1   25010       0      10   25000       0    0.35
 223691 0.50000   qw     PRJ1   12505       0       5   12500       0    0.18
 223683 0.33320   qw       NA    8333       0       0    8333       0    0.12
 223684 0.24990   qw       NA    6250       0       0    6250       0    0.09
 223685 0.19992   qw       NA    5000       0       0    5000       0    0.07
 223686 0.16660   qw       NA    4166       0       0    4166       0    0.06
 223687 0.14280   qw       NA    3571       0       0    3571       0    0.05
 223688 0.12495   qw       NA    3125       0       0    3125       0    0.04
 223689 0.11107   qw       NA    2777       0       0    2777       0    0.04

Setup change:

I now modify the policy hierarchy:

% qconf -msconf

  ticket_policy_hierarchy           FSO

and the qstat output changes to:

 JobId     P      S   Project  Tot-Tkt   ovrts   otckt  ftckt   stckt   shr
 --------------------------------------------------------
 223683 1.00000   qw       NA   25000       0       0   25000       0    0.35
 223684 0.50000   qw       NA   12500       0       0   12500       0    0.18
 223685 0.33333   qw       NA    8333       0       0    8333       0    0.12
 223686 0.25000   qw       NA    6250       0       0    6250       0    0.09
 223687 0.20000   qw       NA    5000       0       0    5000       0    0.07
 223688 0.16667   qw       NA    4166       0       0    4166       0    0.06
 223689 0.14286   qw       NA    3571       0       0    3571       0    0.05
 223690 0.12540   qw     PRJ1    3135       0      10    3125       0    0.04
 223691 0.11131   qw     PRJ1    2782       0       5    2777       0    0.04


N1GE 6 - Scheduler: Department-Based policy setup

In his latest howto Chris Dagdigian wrote about configuring the different policies of the N1GE 6 system to enable Department-Based scheduling. He gives a good summary of the available policies and how to setup the share tree policy. His Howto can be found under: http://bioteam.net/dag/sge6-funct-share-dept.html.


N1GE 6 - Migration From LSF to Sun N1 GE Software

Kirk Patton (Transmeta) describes in his paper how he replaced LSF by N1GE 6 and why they moved to N1GE 6. This is a nice success story and gives a good comparison between the two systems.

http://www.sun.com/bigadmin/features/articles/n1ge_migration.html


N1GE 6 - health monitoring

A software such as our Grid Engine can a critical component in a production environment. Its perfect functioning has the highest priority. However there are cases in which the grid goes down or one of its components is not available. When this happens the administrator or the software has to react right a way. N1GE 6 provides two ways to monitor the correct functioning of its components:

  • the heartbeat file at: <CELL>/common/heartbeat
  • qping.

Qping was enhanced quite a bit with the different update releases. The u4 update contains a fully functional version and that is the version I reference in this blog.

1) Heartbeat file:

The heartbeat file is a simple number that gets increased in a fixed interval. If that number does not change for a couple minutes, that qmaster will most likely stopped its execution.

2) qping:

Qping gives a more comprehensive way of monitoring the grid. It can be used to monitor the qmaster and the execd deamon. Depending on the parameter it is invoked with, one gets a heartbeat replacement or profound information about the status of the daemon. I will give a short introduction into qping for more information consult the qping(1) man page. The monitoring part of the qping command can be executed from every machine under every user.

Heartbeat file replacement:

Command:

qping <MASTER_HOST> $SGE_QMASTER_PORT qmaster 1
qping <EXECD_HOST> $SGE_EXECD_PORT execd 1

Output:

 07/14/2005 14:38:19 endpoint scrabe.workgroup/qmaster/1 at port 7171 is up since 194 seconds

The output format is:

<DATE>

Extensive health information:

Command:

qping -f <MASTER_HOST> $SGE_QMASTER_PORT qmaster 1
qping <EXECD_HOST> $SGE_EXECD_PORT execd 1

Output:

 07/14/2005 14:38:10:
 SIRM version:             0.1
 SIRM message id:          2
 start time:               07/14/2005 14:35:05 (1121344505)
 run time [s]:             185
 messages in read buffer:  0
 messages in write buffer: 0
 nr. of connected clients: 3
 status:                   0
 info:                     TET: R (4.71) | EDT: R (0.71) | SIGT: R (184.61) | MT(1): R (6.17) | MT(2): R (4.62) | OK

The important information, which we did not get in the other output, is a monitoring per thread and the number of messages in the read buffer. The per-thread information allows on to have a more fine grained monitoring and to detect dead locks in the master. The messages in the read buffer can be used as and identifier for an overloaded qmaster. The qping in update 4 and 5 do only show one MT thread even though 2 are used. This will be changed, as one can see in the output above. </blockquote>

The other functions of qping are belong into the debug and analysis domain and definetly worth playing with.


N1GE 6 - Scheduler Hacks: Sorting queues

I just received a question asking on how to use the queue sequence numbers and what to do with them. I will give a short overview in this blog and hope to give enough pointers for ones own experiments. Based on the documentation, the scheduler can sort the queue instances in two ways:

  • load based (from the hosts)
  • sequence number based (from the queues)

The load based sorting is configured by default including load adjustments. The load adjustments are added the host which will run the job during the scheduling cycle. This ensures, that one gets a kind of round robin job distribution. This load adjustment wears of overtime and will be replaced in the host load report interval by the real value. The important configuration values for the queue sorting are (scheduler configuration - qconf -msconf):

 queue_sort_method                 load
 job_load_adjustments              np_load_avg=0.50
 load_adjustment_decay_time        0:7:30
 load_formula                      np_load_avg

This setting will use the load for sorting, it adds for each started job 0.5 to the load of that host and the load will decay over 7.5 minutes.

Hint: If a host has more than 1 slot, the load adjustment can lead to not using all slots on that host, because the next job might overload that host. qstat -j <job_id> will show the reasons, why a job was not dispatched including the hosts, which will not be used due to load adjustments. If np_load_avg is used for the load adjustments and the load formula, the number of processors in one machine is put into account.

Example (using job_load_adjustments np_load_avg=1.5). As one can see, not all slots are used.

% qstat -f

 queuename                      qtype used/tot. load_avg arch          states
 ----------------------------------------------------------------------------
 all.q@host1                     BIP   1/5       0.03     lx24-amd64
     103 0.55500 job        sg144703     r     07/21/2005 09:10:04     1 8
 ----------------------------------------------------------------------------
 all.q@host2                    BIP   3/5       0.78     sol-sparc64
     103 0.55500 job        sg144703     r     07/21/2005 09:10:04     1 5
     103 0.55500 job        sg144703     r     07/21/2005 09:10:04     1 7
     103 0.55500 job        sg144703     r     07/21/2005 09:10:04     1 11
 ----------------------------------------------------------------------------
 all.q@host3                   BIP   2/5       0.28     sol-sparc64
     103 0.55500 job        sg144703     t     07/21/2005 09:10:04     1 6
     103 0.55500 job        sg144703     t     07/21/2005 09:10:04     1 12
 ----------------------------------------------------------------------------
 all.q@host4                    BIP   1/5       0.16     sol-x86
     103 0.55500 job        sg144703     r     07/21/2005 09:10:04     1 10
 ----------------------------------------------------------------------------
 all.q@host5                    BIP   0/5       0.01     sol-x86
 ----------------------------------------------------------------------------
 test.q@host1                    BIP   1/5       0.03     lx24-amd64
     103 0.55500 job        sg144703     r     07/21/2005 09:10:04     1 2
 ----------------------------------------------------------------------------
 test.q@host2                   BIP   0/5       0.78     sol-sparc64   D
 ----------------------------------------------------------------------------
 test.q@host3                   BIP   2/5       0.28     sol-sparc64
     103 0.55500 job        sg144703     r     07/21/2005 09:10:04     1 3
     103 0.55500 job        sg144703     t     07/21/2005 09:10:04     1 9
 ----------------------------------------------------------------------------
 test.q@host4                    BIP   1/5       0.16     sol-x86
     103 0.55500 job        sg144703     r     07/21/2005 09:10:04     1 4
 ----------------------------------------------------------------------------
 test.q@host5                    BIP   1/5       0.01     sol-x86
     103 0.55500 job        sg144703     r     07/21/2005 09:10:04     1 1
 
 ############################################################################
  PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
 ############################################################################
     103 0.00000 job        sg144703     qw    07/21/2005 09:10:02     1 13-20:1

% qstat -j 103

 scheduling info:
   queue instance "test.q@ori" dropped because it is overloaded: np_load_avg=2.511719 (= 0.011719 + 2.50 * 1.000000 with nproc=1) >= 1.75
   queue instance "all.q@ori" dropped because it is overloaded: np_load_avg=2.511719 (= 0.011719 + 2.50 * 1.000000 with nproc=1) >= 2.05
   queue instance "all.q@carc" dropped because it is overloaded: np_load_avg=2.515000 (= 0.015000 + 2.50 * 2.000000 with nproc=1) >= 2.05
   queue instance "test.q@carc" dropped because it is overloaded: np_load_avg=2.515000 (= 0.015000 + 2.50 * 2.000000 with nproc=1) >= 1.75
   queue instance "test.q@gimli" dropped because it is overloaded: np_load_avg=1.945312 (= 0.070312 + 2.50 * 3.000000 with nproc=1) >= 1.75
   queue instance "all.q@nori" dropped because it is overloaded: np_load_avg=2.580078 (= 0.080078 + 2.50 * 2.000000 with nproc=1) >= 2.05
   queue instance "test.q@nori" dropped because it is overloaded: np_load_avg=2.580078 (= 0.080078 + 2.50 * 2.000000 with nproc=1) >= 1.75
   queue instance "all.q@es-ergb01-01" dropped because it is overloaded: np_load_avg=2.070312 (= 0.195312 + 2.50 * 3.000000 with nproc=1) >= 2.05
   queue instance "all.q@gimli" dropped because it is overloaded: np_load_avg=2.570312 (= 0.070312 + 2.50 * 4.000000 with nproc=1) >= 2.05

As we can see, this configuration can be a very powerful tool to setup rather complicated environments. However, there are cases were one would like to ensure that a certain queue is used before another queue. (I am using queue here to reference cluster queues and queue instances together) In these cases, one can assign a sequence number to the queues via qconf -mq <cluster queue name>:

 seq_no                0

This sequence number is used, when the scheduler configuration is changed to:

 queue_sort_method                 seqno

After this change, queue instances with a low seq_no will be chosen first. If there are are multiple queue instances with the same sequence number, the configured load value will be used to determine, which queue instance to pick. This means, if all queue instances have the same seq_no and the scheduler should use the seq_no for sorting, it is ultimately using the load from the hosts.

Example:

"test.q" has a sequence number of 0

"all.q" has a sequence number of 2

 queuename                      qtype used/tot. load_avg arch          states
 ----------------------------------------------------------------------------
 test.q@host1                   BIP   2/5       0.26     lx24-amd64
     108 0.55500 job        sg144703     r     07/21/2005 09:24:44     1 4
     108 0.55500 job        sg144703     r     07/21/2005 09:24:44     1 8
 ----------------------------------------------------------------------------
 test.q@host2           BIP   0/5       0.58     sol-sparc64   D
 ----------------------------------------------------------------------------
 test.q@host3                   BIP   4/5       0.44     sol-sparc64
     108 0.55500 job        sg144703     r     07/21/2005 09:24:44     1 3
     108 0.55500 job        sg144703     r     07/21/2005 09:24:44     1 5
     108 0.55500 job        sg144703     r     07/21/2005 09:24:44     1 7
     108 0.55500 job        sg144703     r     07/21/2005 09:24:44     1 9
 ----------------------------------------------------------------------------
 test.q@host4                   BIP   2/5       0.08     sol-x86
     108 0.55500 job        sg144703     r     07/21/2005 09:24:44     1 2
     108 0.55500 job        sg144703     r     07/21/2005 09:24:44     1 6
 ----------------------------------------------------------------------------
 test.q@host5                   BIP   2/5       0.01     sol-x86
     108 0.55500 job        sg144703     r     07/21/2005 09:24:44     1 1
     108 0.55500 job        sg144703     r     07/21/2005 09:24:44     1 10
 ----------------------------------------------------------------------------
 all.q@host1                    BIP   0/5       0.26     lx24-amd64
 ----------------------------------------------------------------------------
 all.q@host2                   BIP   0/5       0.58     sol-sparc64
 ----------------------------------------------------------------------------
 all.q@host3                    BIP   0/5       0.44     sol-sparc64
 ----------------------------------------------------------------------------
 all.q@host4                    BIP   0/5       0.08     sol-x86
 ----------------------------------------------------------------------------
 all.q@host5                     BIP   0/5       0.01     sol-x86

As one can see, only the test.q was used and within the test.q, the load values had an effect.


N1GE 6 - Scheduler Hacks: consumables as load threshold

This feature is new to N1GE 6 and can be quite powerful, if used correctly. One could model similar behavior earlier with custom load sensors and load adjustments but this is now much more reactive, build in and easier to use. One could also find similarities to the subordinate queues. However, this solution is much more generic. It takes any consumable while the subordinate queues are fixed on slots.

To use this feature one has to create a consumable, assign it to a host or the global host and pick a queue which should be in alarm state, when a certain amount is consumed. Assuming that one assigns 5 licenses to a host, and the queue instance should go into alarm state, when 2 licenses are consumed, one would set:

% qconf -mc

 licence lic int <= YES YES 0 0

% qconf -me test_host

 complex_values lic = 4

% qconf -mq test.q

 load_thresholds lic = 3

This is all that is needed. If now two jobs requesting lic are running on the all.q on the test_host, the test.q will go into alarm state and no further jobs will be started in that queue.

As one can see, this allows very complicated scenarios and a new way of handling job priorities and subordinating queues independly of the number of used slots.

Important: An important difference is the possible definition of the consumable with is used as a load_threshold on queue level. This is not possible with load values.

I am sure it will take a while to play with this feature before useful scenarios are obvious.


N1GE 6 - We want you!

We want you! We need you as testers!

We just finished the first part of the next N1Ge 6 update and would like you to take a look at what we have archived so far. In addition to the usual bug fixes the next update addresses some performance issues as well. Knowing that a software will always be to slow, we mad a cut where we currently are took a look at it and thought that it is time to ask for your opinion before we continue. The open source announcement gives you an overview of the improved parts in grid engine and the link to where you can download it.

In addition to the mentioned changes we included to additional work packages:

1) reworked qstat XML:

We changed some of the xml output and a lot in the schema to make sure that the go hand in hand. The nice part about reworking the schemas is, that you can use JAX-B to generated Java classes out of them. With this change it will be very easy to write a Java Program, which works on the qstat output. In combination with the DRMAA Java interface, one has know an "API" at hand, which allows to write grid enabled applications without to much effort.

Most of the changes make only shure, that the xml output matches the schema files. One change migth break already existing parsers. The dates are now printed in the xml datatime format and no longer in a human readable format. This was required to support JAXB parsers.

2) qmaster monitoring:

To measure the performance enhancements in the qmaster, we implemented a monitoring facility into the qmaster. It is collecting information on the different requests and how long it took to process them. The statistic is generated for each thread and can be printed into the messages files or via qping. More information on the generated statistics can be found here and the man page explains how to enable it.

We are looking forward to get your feedback on this snapshot.

Download: Grid Engine 6.0 Scalability Update 2 Snapshot 1


N1GE 6 - A couple lines on halftime and usage

The usage stored in the sharetree is controlled by its halftime. The halftime can be configured by using the halftime parameter in the scheduler configuration. This parameter specifies the halftime of the usage in hours. 1 hour is the smallest halftime one can set. A setting of 0 will turn the decrease of the usage of, which means that it will always grow.

A short coming on this setting is, that all three usages (io, mem, cpu) are treated equally. The halflife_decay_list the way to overcome this short coming. It allows to set different decay rates for io, mem and cpu usage. The time in this list is specified in minutes and in addition allows to turn of past usage at all, when the time is set to a negative value. A zero means no decrease of usage. If a decay time of -1 is set, all past usage is removed right away and only the usage of the current running jobs is used during the sharetree computation.

The settings of the halflife_decay_list will override the halftime settings.

Some examples (qconf -msconf):

1) Using half_life_decay_list to get the same result as with halftime

   halftime                   24
   halflife_decay_list        cpu=1440:mem=1440:io=1440

2) Disabeling past usage

   halftime                   168
   halflife_decay_list        cpu=-1:mem=-1:io=-1

3) Disable past usage for mem the others have a halftime of 1 day

   halftime                   24
   halflife_decay_list        mem=-1

Important: More than parameter in the halflife_decay_list would crash the system until version 6.0u7. We fixed issue 1826 in 6.0u7


N1GE 6 - Scheduler Hacks: Seperated Master host for pe jobs

In the distributions of pe jobs over a range of hosts, the pe provides a set of allocation rules. These rules allow the admin to specify that a host should be filed up first before another is used, that each host is used before any host runs a second task, or that the job uses a specified amount of slots on each host it is using. This solves most of the use cases around pe jobs.

In this commend I would like to scatch out a scenario which cannot be addressed with the existing allocation rules, the exclusive use by the master task of the master host while all other hosts will use the fill-up allocation rule. This can become handy if the master task of a job requires a lot of memory while the slave tasks do the computation and only one machine with a lot of memory is available. The big machine can and should run multiple master tasks of this job kind.

There are two solutions to the problem. One could separated the memory intense computation out into an extra job and work with job dependencies or one configures N1GE to handle the above use case as specified without any job modifications.

I have the following setup:

% qstat -f

 queuename                qtype used/tot. load_avg arch          states
 ----------------------------------------------------------------------------
 all.q@big                    BIP   0/4       0.02     sol-sparc64
 ----------------------------------------------------------------------------
 small.q@small1        BIP   0/1       0.00     lx24-amd64
 ----------------------------------------------------------------------------
 small.q@small2        BIP   0/1       0.02     sol-sparc64

And a configured pe in all queue instances:

% qconf -sp make

 pe_name               make
 slots                 999
 user_lists            NONE
 xuser_lists           NONE
 start_proc_args       NONE
 stop_proc_args        NONE
 allocation_rule       $fill_up
 control_slaves        TRUE
 job_is_first_task     FALSE
 urgency_slots         min

We now go ahead and change the load_threshold in the all.q@big queue instance to be a load value that is not used in the other queue instances, such as:

% qconf -sq all.q

 qname                 all.q
 hostlist              big
 seq_no                0
 load_thresholds       NONE,[big=load_avg=4]

The used load threshold has to be a real load value and cannot be a fixed or consumable value.

Next step to make our environment work is to change the scheduler configuration to the following:

% qconf -ssconf

 algorithm                         default
 schedule_interval                 0:2:0
 maxujobs                          0
 queue_sort_method                 load
 job_load_adjustments              load_avg=4.000000
 load_adjustment_decay_time        0:0:1

By changing the configuration of the scheduler to use the job_load_adjustments like this, it will add an artificial load to each host, that will run a task. With this configuration we can start one task on the big machine in each scheduling run. Since the load_adjustment_decay_time is only 1 second, the scheduler has forgotten about the artificial load in the next scheduling run and can start a new task on the big host. This way, we archive what we have been looking for.

One important note: The big machine is only allowed to have one queue instance, or all queue instances of the big machine have to share the same load threshold. If that is not the case, it will not work.


N1GE 6 - Monitoring the qmaster

With the update 7 of the N1GE 6 software we added a new switch to monitor the qmaster. The qmaster monitoring allows to get statistics on each thread displaying what they have been busy with and how much time they spend on it. There are two switches to control the statistic output:

% qconf -mconf

 qmaster_params               Monitor_Time=0:0:20 LOG_Monitor_Message=1

MONITOR_TIME

Specifies the time interval when the monitoring information should be printed. The monitoring is disabled per default and can be enabled by specifying an interval. The monitoring is per thread and is written to the messages file or displayed by the "qping -f" command line tool. Example: MONITOR_TIME=0:0:10 generates the monitoring information most likely every 10 seconds and prints it. The specified time is a guideline and not a fixed interval. The used interval is printed and can be everything between 9 seconds and 20 in this example.

LOG_MONITOR_MESSAGE

The monitoring information is logged into the messages files per default. In addition it is provided for qping and can be requested by it. The messages files can become quite big, if the monitoring is enabled all the time, therefore this switch allows to disable the logging into the messages files and the monitoring data will only be available via "qping -f".

A description of the output format can be found here.

Example output in the qmaster messages file ($SGE_ROOT/<CELL>/spooling/qmaster/messages):

 04/25/2006 19:06:17|qmaster|scrabe|P|EDT: runs: 1.20r/s (clients: 1.00 mod: 0.05/s ack: 0.05/s blocked: 0.00 busy: 0.00 | events: 0.05/s added: 0.05/s skipt: 0.00/s) out: 0.00m/s APT: 0.0001s/m idle: 99.99% wait: 0.00% time: 19.98s
 04/25/2006 19:06:17|qmaster|scrabe|P|MT(2): runs: 0.25r/s (execd (l:0.00,j:0.00,c:0.00,p:0.00,a:0.00)/s GDI (a:0.05,g:0.00,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.05/s) out: 0.05m/s APT: 0.0002s/m idle: 100.00% wait: 0.00% time: 20.10s 
 04/25/2006 19:06:18|qmaster|scrabe|P|MT(1): runs: 0.19r/s (execd (l:0.00,j:0.00,c:0.00,p:0.00,a:0.00)/s GDI (a:0.05,g:0.00,m:0.05,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.00/s) out: 0.05m/s APT: 0.0001s/m idle: 100.00% wait: 0.00% time: 21.15s
 04/25/2006 19:06:27|qmaster|scrabe|P|TET: runs: 0.67r/s (pending: 9.00 executed: 0.67/s) out: 0.00m/s APT: 0.0205s/m idle: 98.63% wait: 0.00% time: 21.00s
 04/25/2006 19:06:37|qmaster|scrabe|P|EDT: runs: 1.60r/s (clients: 1.00 mod: 0.05/s ack: 0.05/s blocked: 0.00 busy: 0.00 | events: 1.10/s added: 1.10/s skipt: 0.00/s) out: 0.05m/s APT: 0.0002s/m idle: 99.97% wait: 0.00% time: 20.00s
 04/25/2006 19:06:39|qmaster|scrabe|P|MT(1): runs: 0.37r/s (execd (l:0.00,j:0.00,c:0.00,p:0.00,a:0.00)/s GDI (a:0.14,g:0.00,m:0.05,d:0.00,c:0.00,t:0.05,p:0.00)/s event-acks: 0.05/s) out: 0.32m/s APT: 0.0024s/m idle: 99.91% wait: 0.00% time: 21.55s

If we use the following settings:

% qconf -mconf

 qmaster_params               Monitor_Time=0:0:20 LOG_Monitor_Message=0

We will need to use qping to gain access to the monitoring messages. Thiis should be the prefered way because we will get the statics from the communication layer with the statistics in the qmaster. Here is an example:

 04/25/2006 19:09:53:
 SIRM version:             0.1
 SIRM message id:          3
 start time:               04/25/2006 08:45:06 (1145947506)
 run time [s]:             37487
 messages in read buffer:  0
 messages in write buffer: 0
 nr. of connected clients: 3
 status:                   0
 info:                     TET: R (1.99) | EDT: R (0.99) | SIGT: R (37486.73) | MT(1): R (3.99) |
MT(2): R (0.99) | OK

Monitor:

 04/25/2006 19:09:47 | TET: runs: 0.40r/s (pending: 9.00 executed: 0.40/s) out: 0.00m/s APT: 0.0001s/m idle: 100.00% wait: 0.00% time: 20.00s
 04/25/2006 19:09:37 | EDT: runs: 1.00r/s (clients: 1.00 mod: 0.00/s ack: 0.00/s blocked: 0.00 busy: 0.00 | events: 0.00/s added: 0.00/s skipt: 0.00/s) out: 0.00m/s APT: 0.0001s/m idle: 99.99% wait: 0.00% time: 20.00s
 04/25/2006 08:45:07 | SIGT: no monitoring data available
 04/25/2006 19:09:36 | MT(1): runs: 0.15r/s (execd (l:0.04,j:0.04,c:0.04,p:0.04,a:0.00)/s GDI (a:0.00,g:0.00,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.00/s) out: 0.00m/s APT: 0.0002s/m idle: 100.00% wait: 0.00% time: 26.86s
 04/25/2006 19:09:39 | MT(2): runs: 0.14r/s (execd (l:0.00,j:0.00,c:0.00,p:0.00,a:0.00)/s GDI (a:0.00,g:0.00,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.00/s) out: 0.00m/s APT: 0.0000s/m idle: 100.00% wait: 0.00% time: 21.04s


N1GE 6 - Scheduler Hacks: Exclusive master host for the master task

There was some discussion on the Open Source mailing lists and a lot of interests how one can single out the master task to a special host and have all the slave tasks on the compute nodes. There can be multiple reasons to do this, the one I heard most was that the master tasks needs a lot of memory and a special host exists just for that purpose.

During the discussion, we came across 3 work-arounds for the problem. I will start with the easiest setup and end with the most complicated. Since they are all workarounds, none of them is perfect. Nevertheless, they do archive the goal more or less.

1) Using the host sorting mechanism:

Description:

Grid Engine allows to sort hosts / queues by sequence number. Assuming that we have only one cluster queue and the parallel environment is configured to use fill -up, we can assign the compute queue instances a smaller sequence number than the master machines. The job would request the pe to run in and the master machine as the masterq. This way, all slaves would run on the compute nodes, which are filled-up first, and the master task is singled out to the master machine due to its special request.

If the environment has more than one master host, wild cards in the masterq request can be used to select one of the master host.

Advantages:

Makes best use of all resources, is easy to setup, to understand and debug. This setup has also the least performance impact.

Problems:

As soon as there are not enough compute nodes available, the scheduler will assign more than one task to the master machine.

Configuration:

Change the queue sort oder in the scheduler config:

% qconf -msconf

 queue_sort_method                seqno

The queue for on the small hosts gets:

% qconf -mq <queue>

 seq_no                           0

The queue for the master hosts gets:

% qconf -mq <queue>

 seq_no                           1

A job submit would look like:

% qsub -pe <PE> 6 -masterq "*@master*" ...

2) Making excessive use of pe objects and cluster queues:

Description:

Each slot on a master host needs its own cluster queue and its own pe. The compute nodes are combined under 1 cluster queue with all pe objects that are used on the master hosts. Each master cluster queue has exactly one slot. The job submit will now request the master queue via wild cards and the pe it should run in with wild cards.

Advantages:

Achieves the goal.

Problems:

Many configuration objects. Slows down the scheduler quite a bit.

Configuration:

I will leave the configuration for this one open. Should not be complicated...

3) Using load adjustments:

Description:

The scheduler uses the load adjustments for not overloading an host. The system can be configured in such a way, that the scheduler starts not more than one task on one host eventhough more slots are available. We will use this configuration to archive the desired goal.

Advantages:

Achieves exactly what we are looking for without any additional configuration objects.

Problems:

Slows down scheduling. Only one job requesting the master host will be started in one scheduling run. Supporting backup master hosts is not easy.

The master machine is only allowed to have one queue instance, or all queue instances of the master machine have to share the same load threshold. If that is not the case, it will not work.

Configuration:

I have the following setup:

% qstat -f

 queuename               qtype used/tot. load_avg arch         states
 ----------------------------------------------------------------------------
 all.q@big                  BIP  0/4       0.02   sol-sparc64
 ----------------------------------------------------------------------------
 small.q@small1             BIP  0/1       0.00   lx24-amd64
 ----------------------------------------------------------------------------
 small.q@small2             BIP  0/1       0.02   sol-sparc64

And a configured pe in all queue instances:

% qconf  -sp make

 pe_name             make
 slots               999
 user_lists          NONE
 xuser_lists         NONE
 start_proc_args     NONE
 stop_proc_args      NONE
 allocation_rule     $fill_up
 control_slaves      TRUE
 job_is_first_task   FALSE
 urgency_slots       min

We now go ahead and change the load_threshold in the all.q@big queue instance to be a load value that is not used in the other queue instances, such as:

% qconf -sq all.q

 qname                 all.q
 hostlist              big
 seq_no                0
 load_thresholds       NONE,[big=load_avg=4]

The used load threshold has to be a real load value and cannot be a fixed or consumable value.

Next step to make our enviroment work is to change the scheduler configuration to the following:

% qconf -ssconf

 algorithm                         default
 schedule_interval                 0:2:0
 maxujobs                          0
 queue_sort_method                 load
 job_load_adjustments              load_avg=4.100000
 load_adjustment_decay_time        0:0:1

By changing the configuration of the scheduler to use the job_load_adjustments like this, it will add an artificial load to each host, that will run a task. With this configuration we can start one task on the master machine in each scheduling run. Since the load_adjustment_decay_time is only 1 second, the scheduler has forgotten about the artificial load in the next scheduling run and can start a new task on the master host. This way, we archive what we have been looking for.

Extended Configuration:

If the usage of multiple master hosts is requriered, one need to create one pe object per master host. The compute hosts are part of all pe objects. The same rule as above still applies, each master host is only allowed to have one queue instance. The configuration of the all.q queue would look as follows:

% qconf -sq all.q

 qname                 all.q
 hostlist              big
 seq_no                0
 load_thresholds       NONE,[big=load_avg=4],[big1=load_avg=4],[big2=load_avg=4]
 pe_list               big_pe big1_pe big2_pe,[big=big_pe],[big1=big1_pe],[big2=big2_pe]

The job submit would look like:

% qsub -pe "big*" 5 -masterq="all.q@big*" ....


N1GE 6 - Profiling

The Grid Engine software provides a profiling facility to determain where the qmaster and the scheduler spend their time. This has been introduced long before the N1GE 6 software. With the development of N1GE 6 it was greatly improved and its improvement continued over the the different updates we had for the N1GE 6 software. It was used very extensivly to analyse bottlenecks and find missconfigurations in existing installations. Until now, the source code was the only documentation for the output format, which might change with every new udpate and release. Lately a document was added to the source repository to give a brief overview of the output format and the different switches. The document is not complete, though it is a good start.

Profiling document


It was fun, It was interesting. I still move on.

It is time to say goodbye. This will be the last entry in this blog. I will be leaving Sun in a week to start a new adventure. I did enjoy working with the Sun Grid team. I got to know lots of passionate and knowledge people. I hope the contacts will not entirely go away even though I switch cities, signed up with the competion, and will most likely be as busy as I was now.

From what I have seen so far my new home town, Aachen, will be nearly as nice as Regensburg. Its a bit bigger and right at the border to Netherlands and Belgium.

So, for everyone, who wants to stay in contact, my email address is: sgrell @ gmx.de.

Good bye,

Stephan