N1GE 6 - Monitoring the qmaster
With the update 7 of the N1GE 6 software we added a new switch to monitor the qmaster. The qmaster monitoring allows to get statistics on each thread displaying what they have been busy with and how much time they spend on it. There are two switches to control the statistic output:
% qconf -mconf
qmaster_params Monitor_Time=0:0:20 LOG_Monitor_Message=1
Specifies the time interval when the monitoring information should be printed. The monitoring is disabled per default and can be enabled by specifying an interval. The monitoring is per thread and is written to the messages file or displayed by the "qping -f" command line tool. Example: MONITOR_TIME=0:0:10 generates the monitoring information most likely every 10 seconds and prints it. The specified time is a guideline and not a fixed interval. The used interval is printed and can be everything between 9 seconds and 20 in this example.
The monitoring information is logged into the messages files per default. In addition it is provided for qping and can be requested by it. The messages files can become quite big, if the monitoring is enabled all the time, therefore this switch allows to disable the logging into the messages files and the monitoring data will only be available via "qping -f".
A description of the output format can be found here.
Example output in the qmaster messages file ($SGE_ROOT/<CELL>/spooling/qmaster/messages):
04/25/2006 19:06:17|qmaster|scrabe|P|EDT: runs: 1.20r/s (clients: 1.00 mod: 0.05/s ack: 0.05/s blocked: 0.00 busy: 0.00 | events: 0.05/s added: 0.05/s skipt: 0.00/s) out: 0.00m/s APT: 0.0001s/m idle: 99.99% wait: 0.00% time: 19.98s 04/25/2006 19:06:17|qmaster|scrabe|P|MT(2): runs: 0.25r/s (execd (l:0.00,j:0.00,c:0.00,p:0.00,a:0.00)/s GDI (a:0.05,g:0.00,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.05/s) out: 0.05m/s APT: 0.0002s/m idle: 100.00% wait: 0.00% time: 20.10s 04/25/2006 19:06:18|qmaster|scrabe|P|MT(1): runs: 0.19r/s (execd (l:0.00,j:0.00,c:0.00,p:0.00,a:0.00)/s GDI (a:0.05,g:0.00,m:0.05,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.00/s) out: 0.05m/s APT: 0.0001s/m idle: 100.00% wait: 0.00% time: 21.15s 04/25/2006 19:06:27|qmaster|scrabe|P|TET: runs: 0.67r/s (pending: 9.00 executed: 0.67/s) out: 0.00m/s APT: 0.0205s/m idle: 98.63% wait: 0.00% time: 21.00s 04/25/2006 19:06:37|qmaster|scrabe|P|EDT: runs: 1.60r/s (clients: 1.00 mod: 0.05/s ack: 0.05/s blocked: 0.00 busy: 0.00 | events: 1.10/s added: 1.10/s skipt: 0.00/s) out: 0.05m/s APT: 0.0002s/m idle: 99.97% wait: 0.00% time: 20.00s 04/25/2006 19:06:39|qmaster|scrabe|P|MT(1): runs: 0.37r/s (execd (l:0.00,j:0.00,c:0.00,p:0.00,a:0.00)/s GDI (a:0.14,g:0.00,m:0.05,d:0.00,c:0.00,t:0.05,p:0.00)/s event-acks: 0.05/s) out: 0.32m/s APT: 0.0024s/m idle: 99.91% wait: 0.00% time: 21.55s
If we use the following settings:
% qconf -mconf
qmaster_params Monitor_Time=0:0:20 LOG_Monitor_Message=0
We will need to use qping to gain access to the monitoring messages. Thiis should be the prefered way because we will get the statics from the communication layer with the statistics in the qmaster. Here is an example:
SIRM version: 0.1 SIRM message id: 3 start time: 04/25/2006 08:45:06 (1145947506) run time [s]: 37487 messages in read buffer: 0 messages in write buffer: 0 nr. of connected clients: 3 status: 0 info: TET: R (1.99) | EDT: R (0.99) | SIGT: R (37486.73) | MT(1): R (3.99) | MT(2): R (0.99) | OK
04/25/2006 19:09:47 | TET: runs: 0.40r/s (pending: 9.00 executed: 0.40/s) out: 0.00m/s APT: 0.0001s/m idle: 100.00% wait: 0.00% time: 20.00s 04/25/2006 19:09:37 | EDT: runs: 1.00r/s (clients: 1.00 mod: 0.00/s ack: 0.00/s blocked: 0.00 busy: 0.00 | events: 0.00/s added: 0.00/s skipt: 0.00/s) out: 0.00m/s APT: 0.0001s/m idle: 99.99% wait: 0.00% time: 20.00s 04/25/2006 08:45:07 | SIGT: no monitoring data available 04/25/2006 19:09:36 | MT(1): runs: 0.15r/s (execd (l:0.04,j:0.04,c:0.04,p:0.04,a:0.00)/s GDI (a:0.00,g:0.00,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.00/s) out: 0.00m/s APT: 0.0002s/m idle: 100.00% wait: 0.00% time: 26.86s 04/25/2006 19:09:39 | MT(2): runs: 0.14r/s (execd (l:0.00,j:0.00,c:0.00,p:0.00,a:0.00)/s GDI (a:0.00,g:0.00,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.00/s) out: 0.00m/s APT: 0.0000s/m idle: 100.00% wait: 0.00% time: 21.04s
N1GE 6 - Scheduler Hacks: Exclusive master host for the master task
There was some discussion on the Open Source mailing lists and a lot of interests how one can single out the master task to a special host and have all the slave tasks on the compute nodes. There can be multiple reasons to do this, the one I heard most was that the master tasks needs a lot of memory and a special host exists just for that purpose.
During the discussion, we came across 3 work-arounds for the problem. I will start with the easiest setup and end with the most complicated. Since they are all workarounds, none of them is perfect. Nevertheless, they do archive the goal more or less.
1) Using the host sorting mechanism:
Grid Engine allows to sort hosts / queues by sequence number. Assuming that we have only one cluster queue and the parallel environment is configured to use fill -up, we can assign the compute queue instances a smaller sequence number than the master machines. The job would request the pe to run in and the master machine as the masterq. This way, all slaves would run on the compute nodes, which are filled-up first, and the master task is singled out to the master machine due to its special request.
If the environment has more than one master host, wild cards in the masterq request can be used to select one of the master host.
Makes best use of all resources, is easy to setup, to understand and debug. This setup has also the least performance impact.
As soon as there are not enough compute nodes available, the scheduler will assign more than one task to the master machine.
Change the queue sort oder in the scheduler config:
% qconf -msconf
The queue for on the small hosts gets:
% qconf -mq <queue>
The queue for the master hosts gets:
% qconf -mq <queue>
A job submit would look like:
% qsub -pe <PE> 6 -masterq "*@master*" ...
2) Making excessive use of pe objects and cluster queues:
Each slot on a master host needs its own cluster queue and its own pe. The compute nodes are combined under 1 cluster queue with all pe objects that are used on the master hosts. Each master cluster queue has exactly one slot. The job submit will now request the master queue via wild cards and the pe it should run in with wild cards.
Achieves the goal.
Many configuration objects. Slows down the scheduler quite a bit.
I will leave the configuration for this one open. Should not be complicated...
3) Using load adjustments:
The scheduler uses the load adjustments for not overloading an host. The system can be configured in such a way, that the scheduler starts not more than one task on one host eventhough more slots are available. We will use this configuration to archive the desired goal.
Achieves exactly what we are looking for without any additional configuration objects.
Slows down scheduling. Only one job requesting the master host will be started in one scheduling run. Supporting backup master hosts is not easy.
The master machine is only allowed to have one queue instance, or all queue instances of the master machine have to share the same load threshold. If that is not the case, it will not work.
I have the following setup:
% qstat -f
queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@big BIP 0/4 0.02 sol-sparc64 ---------------------------------------------------------------------------- small.q@small1 BIP 0/1 0.00 lx24-amd64 ---------------------------------------------------------------------------- small.q@small2 BIP 0/1 0.02 sol-sparc64
And a configured pe in all queue instances:
% qconf -sp make
pe_name make slots 999 user_lists NONE xuser_lists NONE start_proc_args NONE stop_proc_args NONE allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min
We now go ahead and change the load_threshold in the all.q@big queue instance to be a load value that is not used in the other queue instances, such as:
% qconf -sq all.q
qname all.q hostlist big seq_no 0 load_thresholds NONE,[big=load_avg=4]
The used load threshold has to be a real load value and cannot be a fixed or consumable value.
Next step to make our enviroment work is to change the scheduler configuration to the following:
% qconf -ssconf
algorithm default schedule_interval 0:2:0 maxujobs 0 queue_sort_method load job_load_adjustments load_avg=4.100000 load_adjustment_decay_time 0:0:1
By changing the configuration of the scheduler to use the job_load_adjustments like this, it will add an artificial load to each host, that will run a task. With this configuration we can start one task on the master machine in each scheduling run. Since the load_adjustment_decay_time is only 1 second, the scheduler has forgotten about the artificial load in the next scheduling run and can start a new task on the master host. This way, we archive what we have been looking for.
If the usage of multiple master hosts is requriered, one need to create one pe object per master host. The compute hosts are part of all pe objects. The same rule as above still applies, each master host is only allowed to have one queue instance. The configuration of the all.q queue would look as follows:
% qconf -sq all.q
qname all.q hostlist big seq_no 0 load_thresholds NONE,[big=load_avg=4],[big1=load_avg=4],[big2=load_avg=4] pe_list big_pe big1_pe big2_pe,[big=big_pe],[big1=big1_pe],[big2=big2_pe]
The job submit would look like:
% qsub -pe "big*" 5 -masterq="all.q@big*" ....
N1GE 6 - Profiling
The Grid Engine software provides a profiling facility to determain where the qmaster and the scheduler spend their time. This has been introduced long before the N1GE 6 software. With the development of N1GE 6 it was greatly improved and its improvement continued over the the different updates we had for the N1GE 6 software. It was used very extensivly to analyse bottlenecks and find missconfigurations in existing installations. Until now, the source code was the only documentation for the output format, which might change with every new udpate and release. Lately a document was added to the source repository to give a brief overview of the output format and the different switches. The document is not complete, though it is a good start.
It was fun, It was interesting. I still move on.
It is time to say goodbye. This will be the last entry in this blog. I will be leaving Sun in a week to start a new adventure. I did enjoy working with the Sun Grid team. I got to know lots of passionate and knowledge people. I hope the contacts will not entirely go away even though I switch cities, signed up with the competion, and will most likely be as busy as I was now.
So, for everyone, who wants to stay in contact, my email address is: sgrell @ gmx.de.