StephansBlog

From GridWiki
Revision as of 02:01, 28 March 2007 by Dant (talk | contribs) (Added another post)
Jump to: navigation, search

N1GE 6 - Monitoring the qmaster

With the update 7 of the N1GE 6 software we added a new switch to monitor the qmaster. The qmaster monitoring allows to get statistics on each thread displaying what they have been busy with and how much time they spend on it. There are two switches to control the statistic output:

% qconf -mconf

 qmaster_params               Monitor_Time=0:0:20 LOG_Monitor_Message=1

MONITOR_TIME

Specifies the time interval when the monitoring information should be printed. The monitoring is disabled per default and can be enabled by specifying an interval. The monitoring is per thread and is written to the messages file or displayed by the "qping -f" command line tool. Example: MONITOR_TIME=0:0:10 generates the monitoring information most likely every 10 seconds and prints it. The specified time is a guideline and not a fixed interval. The used interval is printed and can be everything between 9 seconds and 20 in this example.

LOG_MONITOR_MESSAGE

The monitoring information is logged into the messages files per default. In addition it is provided for qping and can be requested by it. The messages files can become quite big, if the monitoring is enabled all the time, therefore this switch allows to disable the logging into the messages files and the monitoring data will only be available via "qping -f".

A description of the output format can be found here.

Example output in the qmaster messages file ($SGE_ROOT/<CELL>/spooling/qmaster/messages):

 04/25/2006 19:06:17|qmaster|scrabe|P|EDT: runs: 1.20r/s (clients: 1.00 mod: 0.05/s ack: 0.05/s blocked: 0.00 busy: 0.00 | events: 0.05/s added: 0.05/s skipt: 0.00/s) out: 0.00m/s APT: 0.0001s/m idle: 99.99% wait: 0.00% time: 19.98s
 04/25/2006 19:06:17|qmaster|scrabe|P|MT(2): runs: 0.25r/s (execd (l:0.00,j:0.00,c:0.00,p:0.00,a:0.00)/s GDI (a:0.05,g:0.00,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.05/s) out: 0.05m/s APT: 0.0002s/m idle: 100.00% wait: 0.00% time: 20.10s 
 04/25/2006 19:06:18|qmaster|scrabe|P|MT(1): runs: 0.19r/s (execd (l:0.00,j:0.00,c:0.00,p:0.00,a:0.00)/s GDI (a:0.05,g:0.00,m:0.05,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.00/s) out: 0.05m/s APT: 0.0001s/m idle: 100.00% wait: 0.00% time: 21.15s
 04/25/2006 19:06:27|qmaster|scrabe|P|TET: runs: 0.67r/s (pending: 9.00 executed: 0.67/s) out: 0.00m/s APT: 0.0205s/m idle: 98.63% wait: 0.00% time: 21.00s
 04/25/2006 19:06:37|qmaster|scrabe|P|EDT: runs: 1.60r/s (clients: 1.00 mod: 0.05/s ack: 0.05/s blocked: 0.00 busy: 0.00 | events: 1.10/s added: 1.10/s skipt: 0.00/s) out: 0.05m/s APT: 0.0002s/m idle: 99.97% wait: 0.00% time: 20.00s
 04/25/2006 19:06:39|qmaster|scrabe|P|MT(1): runs: 0.37r/s (execd (l:0.00,j:0.00,c:0.00,p:0.00,a:0.00)/s GDI (a:0.14,g:0.00,m:0.05,d:0.00,c:0.00,t:0.05,p:0.00)/s event-acks: 0.05/s) out: 0.32m/s APT: 0.0024s/m idle: 99.91% wait: 0.00% time: 21.55s

If we use the following settings:

% qconf -mconf

 qmaster_params               Monitor_Time=0:0:20 LOG_Monitor_Message=0

We will need to use qping to gain access to the monitoring messages. Thiis should be the prefered way because we will get the statics from the communication layer with the statistics in the qmaster. Here is an example:

 04/25/2006 19:09:53:
 SIRM version:             0.1
 SIRM message id:          3
 start time:               04/25/2006 08:45:06 (1145947506)
 run time [s]:             37487
 messages in read buffer:  0
 messages in write buffer: 0
 nr. of connected clients: 3
 status:                   0
 info:                     TET: R (1.99) | EDT: R (0.99) | SIGT: R (37486.73) | MT(1): R (3.99) |
MT(2): R (0.99) | OK

Monitor:

 04/25/2006 19:09:47 | TET: runs: 0.40r/s (pending: 9.00 executed: 0.40/s) out: 0.00m/s APT: 0.0001s/m idle: 100.00% wait: 0.00% time: 20.00s
 04/25/2006 19:09:37 | EDT: runs: 1.00r/s (clients: 1.00 mod: 0.00/s ack: 0.00/s blocked: 0.00 busy: 0.00 | events: 0.00/s added: 0.00/s skipt: 0.00/s) out: 0.00m/s APT: 0.0001s/m idle: 99.99% wait: 0.00% time: 20.00s
 04/25/2006 08:45:07 | SIGT: no monitoring data available
 04/25/2006 19:09:36 | MT(1): runs: 0.15r/s (execd (l:0.04,j:0.04,c:0.04,p:0.04,a:0.00)/s GDI (a:0.00,g:0.00,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.00/s) out: 0.00m/s APT: 0.0002s/m idle: 100.00% wait: 0.00% time: 26.86s
 04/25/2006 19:09:39 | MT(2): runs: 0.14r/s (execd (l:0.00,j:0.00,c:0.00,p:0.00,a:0.00)/s GDI (a:0.00,g:0.00,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.00/s) out: 0.00m/s APT: 0.0000s/m idle: 100.00% wait: 0.00% time: 21.04s


N1GE 6 - Scheduler Hacks: Exclusive master host for the master task

There was some discussion on the Open Source mailing lists and a lot of interests how one can single out the master task to a special host and have all the slave tasks on the compute nodes. There can be multiple reasons to do this, the one I heard most was that the master tasks needs a lot of memory and a special host exists just for that purpose.

During the discussion, we came across 3 work-arounds for the problem. I will start with the easiest setup and end with the most complicated. Since they are all workarounds, none of them is perfect. Nevertheless, they do archive the goal more or less.

1) Using the host sorting mechanism:

Description:

Grid Engine allows to sort hosts / queues by sequence number. Assuming that we have only one cluster queue and the parallel environment is configured to use fill -up, we can assign the compute queue instances a smaller sequence number than the master machines. The job would request the pe to run in and the master machine as the masterq. This way, all slaves would run on the compute nodes, which are filled-up first, and the master task is singled out to the master machine due to its special request.

If the environment has more than one master host, wild cards in the masterq request can be used to select one of the master host.

Advantages:

Makes best use of all resources, is easy to setup, to understand and debug. This setup has also the least performance impact.

Problems:

As soon as there are not enough compute nodes available, the scheduler will assign more than one task to the master machine.

Configuration:

Change the queue sort oder in the scheduler config:

% qconf -msconf

 queue_sort_method                seqno

The queue for on the small hosts gets:

% qconf -mq <queue>

 seq_no                           0

The queue for the master hosts gets:

% qconf -mq <queue>

 seq_no                           1

A job submit would look like:

% qsub -pe <PE> 6 -masterq "*@master*" ...

2) Making excessive use of pe objects and cluster queues:

Description:

Each slot on a master host needs its own cluster queue and its own pe. The compute nodes are combined under 1 cluster queue with all pe objects that are used on the master hosts. Each master cluster queue has exactly one slot. The job submit will now request the master queue via wild cards and the pe it should run in with wild cards.

Advantages:

Achieves the goal.

Problems:

Many configuration objects. Slows down the scheduler quite a bit.

Configuration:

I will leave the configuration for this one open. Should not be complicated...

3) Using load adjustments:

Description:

The scheduler uses the load adjustments for not overloading an host. The system can be configured in such a way, that the scheduler starts not more than one task on one host eventhough more slots are available. We will use this configuration to archive the desired goal.

Advantages:

Achieves exactly what we are looking for without any additional configuration objects.

Problems:

Slows down scheduling. Only one job requesting the master host will be started in one scheduling run. Supporting backup master hosts is not easy.

The master machine is only allowed to have one queue instance, or all queue instances of the master machine have to share the same load threshold. If that is not the case, it will not work.

Configuration:

I have the following setup:

% qstat -f

 queuename               qtype used/tot. load_avg arch         states
 ----------------------------------------------------------------------------
 all.q@big                  BIP  0/4       0.02   sol-sparc64
 ----------------------------------------------------------------------------
 small.q@small1             BIP  0/1       0.00   lx24-amd64
 ----------------------------------------------------------------------------
 small.q@small2             BIP  0/1       0.02   sol-sparc64

And a configured pe in all queue instances:

% qconf  -sp make

 pe_name             make
 slots               999
 user_lists          NONE
 xuser_lists         NONE
 start_proc_args     NONE
 stop_proc_args      NONE
 allocation_rule     $fill_up
 control_slaves      TRUE
 job_is_first_task   FALSE
 urgency_slots       min

We now go ahead and change the load_threshold in the all.q@big queue instance to be a load value that is not used in the other queue instances, such as:

% qconf -sq all.q

 qname                 all.q
 hostlist              big
 seq_no                0
 load_thresholds       NONE,[big=load_avg=4]

The used load threshold has to be a real load value and cannot be a fixed or consumable value.

Next step to make our enviroment work is to change the scheduler configuration to the following:

% qconf -ssconf

 algorithm                         default
 schedule_interval                 0:2:0
 maxujobs                          0
 queue_sort_method                 load
 job_load_adjustments              load_avg=4.100000
 load_adjustment_decay_time        0:0:1

By changing the configuration of the scheduler to use the job_load_adjustments like this, it will add an artificial load to each host, that will run a task. With this configuration we can start one task on the master machine in each scheduling run. Since the load_adjustment_decay_time is only 1 second, the scheduler has forgotten about the artificial load in the next scheduling run and can start a new task on the master host. This way, we archive what we have been looking for.

Extended Configuration:

If the usage of multiple master hosts is requriered, one need to create one pe object per master host. The compute hosts are part of all pe objects. The same rule as above still applies, each master host is only allowed to have one queue instance. The configuration of the all.q queue would look as follows:

% qconf -sq all.q

 qname                 all.q
 hostlist              big
 seq_no                0
 load_thresholds       NONE,[big=load_avg=4],[big1=load_avg=4],[big2=load_avg=4]
 pe_list               big_pe big1_pe big2_pe,[big=big_pe],[big1=big1_pe],[big2=big2_pe]

The job submit would look like:

% qsub -pe "big*" 5 -masterq="all.q@big*" ....




N1GE 6 - Profiling

The Grid Engine software provides a profiling facility to determain where the qmaster and the scheduler spend their time. This has been introduced long before the N1GE 6 software. With the development of N1GE 6 it was greatly improved and its improvement continued over the the different updates we had for the N1GE 6 software. It was used very extensivly to analyse bottlenecks and find missconfigurations in existing installations. Until now, the source code was the only documentation for the output format, which might change with every new udpate and release. Lately a document was added to the source repository to give a brief overview of the output format and the different switches. The document is not complete, though it is a good start.

Profiling document




It was fun, It was interesting. I still move on.

It is time to say goodbye. This will be the last entry in this blog. I will be leaving Sun in a week to start a new adventure. I did enjoy working with the Sun Grid team. I got to know lots of passionate and knowledge people. I hope the contacts will not entirely go away even though I switch cities, signed up with the competion, and will most likely be as busy as I was now.

From what I have seen so far my new home town, Aachen, will be nearly as nice as Regensburg. Its a bit bigger and right at the border to Netherlands and Belgium.

So, for everyone, who wants to stay in contact, my email address is: sgrell @ gmx.de.

Good bye,

Stephan