Olesen-FLEXlm-Integration

From GridWiki
Jump to: navigation, search

Contents

Grid Engine 6 FlexLM License Integration (Olesen Method)

  • This document was last updated: ----fx Tue May 4 17:20:50 BST 2010
  • Computational chemistry software from Schrödinger was extensively used during the creation and testing of this document and the methods it describes. It is difficult to write about FLEXlm license integration without having licensed applications onhand to test with. The authors of this document would like to acknowledge and thank Schrödinger for their generous donation of software and licenses.

Background

Integrating license applications into Grid Engine systems is pretty easy if:

  • The application is unrestricted due to site-wide or enterprise licensing agreements
  • The application is node-locked to particular cluster nodes
  • The available license pool is larger than the number of job slots managed by Grid Engine
  • The license server and application is exclusive to the Grid Engine system (no external usage at all)


If you need to integrate a licensed application into a Grid Engine (aka N1GE) managed system, and any of the above conditions applies to your situation then consider yourself lucky as you will not need to use the methods described in this document. Node-locked applications can be handled via simple configuration of Grid Engine requestable resources and dedicated cluster license servers (where the cluster nodes are the only possible consumers of license tokens) can be easily handled via user requestable, consumable resources. Search the Grid Engine or N1GE documentation for the words highlighted in bold for specific implementation details.


This document is for people who need to handle multiple license servers scattered throughout an enterprise or instances where the Grid Engine cluster is not the sole consumer of license tokens. This document describes a new method of making the Grid Engine scheduler and resource allocation policy mechanisms aware of the ever changing license availibility data.


Readers are encouraged to follow and read the links located in the Primary References section. There is a significant amount of intro and background material contained in those links that is not duplicated here.

History

Mark Olesen, posting on the Grid Engine users list, made some very interesting comments about FlexLM license integration when Grid Engine load sensors are used for resource reporting. In particular, he proposed a method by which the entire load sensor process could be bypassed altogether in favor of methods that directly adjust the value of user requestable consumable resources within the Grid Engine complex.

Mark also mentioned that he was willing to share his code but that he did not have the time to handle user questions or any sort of support related duties. Chris Dagdigian volunteered to provide support via the http://gridengine.info site and the users@gridengine.sunsource.net mailing list.


Primary References

Additional Reference Material

The Proposed Methods

Problems With The Load Sensor Approach

An obvious problem with the load sensor approach is the delay associated with the load reports, as mentioned in the online documentation (see http://gridengine.sunsource.net/project/gridengine/howto/resource_management.html):

  Unfortunately, due to the loadsensor's delay, it can't be 100% excluded
  that batch jobs are dispatched and started although the license has been
  aquired by an interactive job.

The problem is actually much more serious than suggested by this warning! A race condition between a GridEngine job and an interactive job is less problematic than what actually occurs.

In the following examples, we'll examine how the licenses are managed with different approaches. For the sake of clarity, a new pseudo-variables internal_count and available have been introduced to reflect the current internal GridEngine state. The other variables - complex_values and load_values - are retrieved via qconf -se global.

The Pure Load Sensor Approach

Here the complex_values are left as NONE. The license availability is managed exclusively over the load sensor. This combination has the interesting side-effect that the internal bookkeeping is not used.

Consider the following:

Start:
    all licenses are available 

  load_values      license=4
  complex_values   NONE
  (internal_count) NONE
  (available)      license=4

Then:
   launch X jobs, each with -l license=4

Since all nodes provide resource license=4 and there is no internal bookkeeping to track the consumption of the resource, all jobs attempt to start at the same scheduling interval. Only one job wins the race and others fail with licensing problems.

A Combined Internal and Load Sensor Approach

Here the complex_values are set to the number of licenses available. The GridEngine decides on availability based on complex_values minus (internal_count) or the load_values. The lowest value dictates the availability, as mentioned in the online documentation (see http://gridengine.sunsource.net/project/gridengine/howto/loadsensor.html):

  The lesser of the Consumable Resources or the load sensor
  value will be used to prevent license oversubscription.

Consider the following:

Start:
    all licenses are available 

  load_values      license=4
  complex_values   license=4
  (internal_count) NONE
  (available)      license=4

Then:
   launch two jobs, each with -l license=4:

Since complex_values exist, the internal bookkeeping is used to track license availability and only one job is dispatched:

  load_values      license=4
  complex_values   license=4
  (internal_count) license=4
  (available)      license=0

After some delay, the load sensor will catch up to the current status.

  load_values      license=0
  complex_values   license=4
  (internal_count) license=4
  (available)      license=0

When the first job finishes, the internal count will increase.

  load_values      license=0
  complex_values   license=4
  (internal_count) license=0
  (available)      license=0

After some delay, the load sensor will catch up to the current status and the second job can start.

  load_values      license=4
  complex_values   license=4
  (internal_count) license=4
  (available)      license=4

Despite some delays associated with the load sensor, only a single job is started and this approach seems to be behaving as expected. However, the bookkeeping becomes less robust when non-GridEngine usage is tracked too!

Start:
    all licenses are available 

  load_values      license=4
  complex_values   license=4
  (internal_count) NONE
  (available)      license=4

Then:
start a non-GridEngine job that occupies 2 licenses

After a delay, the load sensor reports that only two licenses are available.

  load_values      license=2
  complex_values   license=4
  (internal_count) NONE
  (available)      license=2

Then:
launch two jobs via the GridEngine, each with -l license=2:

Since there are only 2 licenses available, and internal bookkeeping tracks the resource consumption, only one job is started at the first scheduling interval. The internal count is incremented accordingly:

  load_values      license=2
  complex_values   license=4
  (internal_count) license=2
  (available)      license=2

At the next scheduling interval, there are still 2 licenses available (the lower limit of the internal bookkeeping and the external load report) and the second job will be started. This job will fail with licensing problems.

It is obvious from the above examples that these approaches cannot work correctly with a mixed license usage.


A Proposed Solution

The only obvious solution to the problem is to change the load sensor so that it does not report any values at all, but instead adjusts the complex_values directly.

Start:
    all licenses are available 

  load_values      NONE
  complex_values   license=4
  (internal_count) NONE
  (available)      license=4

Then:
start a non-GridEngine job that occupies 2 licenses:

After a delay, the load sensor adjusts the number of licenses available for the GridEngine.

  load_values      NONE
  complex_values   license=2
  (internal_count) NONE
  (available)      license=2

Apart from the delay inherent with the load sensor approach, there is no internal race condition and we've thus eliminated the significant failings of the previous problems. Small problems still exist, but at least the worst problems have been addressed.

Usable Code

Understand The License Terms

Mark Olesen has released his code under a Creative Commons license:

QUICK SUMMARY

  You are free:
    * to copy, distribute, display, and utilize the work
    * to make derivative works

  Under the following conditions:
    * You must attribute the work and leave copyrights intact.
    * You may not use this work for commercial purposes (i.e., sell it) --
      unless you get the licensor's permission.

 * If you alter, transform, or build upon this work, you may distribute
   the resulting work only under a license identical to this one.

For the full license terms, visit http://creativecommons.org/licenses/by-nc-sa/2.5/

Download

Please note that this code is provided as a courtesy to other users with absolutely no guarantees! Usage questions should be posted to the users@gridengine.sunsource.net mailing list - please do not email the author directly.

flex-grid/
flex-grid/COPYING
flex-grid/README.txt
flex-grid/scripts/filter-accounting
flex-grid/scripts/qlic
flex-grid/site/qlicserver
flex-grid/site/qloadsensor
flex-grid/site/qlicserver.limit.EXAMPLE/foam
flex-grid/site/qlicserver.limit.EXAMPLE/gtpower
flex-grid/site/qlicserver.limit.EXAMPLE/starcd
flex-grid/site/qlicserver.config.EXAMPLE
...

qlic

The utility program 'qlic' lists the contents of the qlicserver license data cache file with some pretty printing. Example output:

host = dealog01
age  = 0:00:47
feature       total limit extern intern wait free
-------       ----- ----- ------ ------ ---- ----
abaqus           12     .      5      5   10    2
foam             40     .      *     16    .   24
gtise             5     .      2      *    .    3
gtpower           8     4      .      -    .    4
hexa              2     .      .      *    .    2
nastran           3     .      .      3    .    .
...

Non-managed licenses are marked with '*'.

usage:
    qlic [OPTION]
    qlic [OPTION] resource=limit .. resource=limit

with options:
  -a       combined with '-u' = include active jobs
  -c FILE  alternative file to parse
  -C FILE  alternative location for the license limit
  -d       dump license cache in raw  format
  -D       dump license cache in perl format
  -f       display free licenses only
  -l       list license limit
  -q       display free licenses via qhost query
  -u       license usage via 'lacct'
  -U       license usage per user via 'lacct -u'
  -w       show who/where ('[]' indicates waiting jobs)
  -h       this help

* extract / display information for the GridEngine license cache
  /opt/grid/default/site/cache/qlicserver.xml

* adjust / display information for the license limits
  /opt/grid/default/site/qlicserver.limits

copyright (c) 2003-09 <Mark.Olesen@emconTechnologies.com>

Licensed and distributed under the Creative Commons
Attribution-NonCommercial-ShareAlike 3.0 License.
http://creativecommons.org/licenses/by-nc-sa/3.0


Installing and configuring 'qlicserver'

Step by Step

Assumptions

This step-by-step guide will use FlexLM licensed computational chemistry software produced by Schrödinger for this example. It is difficult to write about FlexLM license integration without actually having licensed applications onhand to aid in testing and troubleshooting. The authors of this document would like to acknowledge and thank Schrödinger for providing software and a range of floating license tokens.

Schrödinger "Glide" software is interesting in how it is licensed. Each instance of a Glide application running on a cluster node will consume 4x "IMPACT_GLIDE" license tokens and 1x "IMPACT_MAIN" token. This means that in order to integrate Glide into a Grid Engine cluster it is necessary to make Grid Engine constantly aware of the number of available "IMPACT_GLIDE" and "IMPACT_MAIN" resources.

License servers can be queried by vendor supplied tools, in this case the output of the Schrödinger vendor license server looks like this:

$ /opt/schrodinger/licadmin STAT
lmutil - Copyright (c) 1989-2004 by Macrovision Corporation. All rights reserved.
Flexible License Manager status on Tue 11/15/2005 19:33

License server status: 27000@gw
    License file(s) on gw: /opt/schrodinger/license:

        gw: license server UP (MASTER) v9.5

Vendor daemon status (on gw):

    SCHROD: UP v9.5

Feature usage info:

Users of IMPACT_GLIDE:  (Total of 40 licenses issued;  Total of 0 licenses in use)

Users of MAESTRO_MAIN:  (Total of 1000 licenses issued;  Total of 0 licenses in use)

Users of MMLIBS:  (Total of 1000 licenses issued;  Total of 0 licenses in use)

Users of IMPACT_MAIN:  (Total of 20 licenses issued;  Total of 0 licenses in use)

In this case we are only interested in the status of the IMPACT_GLIDE and IMPACT_MAIN tokens.

Obtain the software

$ wget http://github.com/olesenm/flex-grid/tarball/master
...
2009-10-13 09:45:50 (101 KB/s) - 'olesenm-flex-grid-3e55c61.tar.gz' saved [27772/27772]

$ zcat olesenm-flex-grid-3e55c61.tar.gz | tar xvf -

Since we grabbed a tarball from github, the repository owner is prefixed and the commit number is appended to the end. We'll simply rename the directory to get rid of those:

$ mv olesenm-flex-grid-3e55c61 flex-grid

In this example, the SGE admin username is 'sgeadmin'. The qlicserver application will be installed into /home/sgeadmin/qlicserver.

Running qlicserver this way will not generate any output. At least for testing and troubleshooting, the output should send to a file or STDOUT via the "output" parameter. Sending to STDOUT for example,

$ cd flex-grid/site
$ ./qlicserver -n output=-


Ready for testing

At this point, the following steps have been completed:

  1. Verified that the command "lmutil lmstat -a -c $LM_LICENSE_FILE" works via the command line
  2. qlicserver downloaded and configured to monitor the appropriate license tokens. The command "./qlicserver -i" can be helpful here.
  3. The command "./qlicserver -c" used to generate the text necessary to register the new consumable resource attributes within Grid Engine.
  4. Used the Grid Engine command "qconf -mc", paste in text from step #3 to create the new consumable resource attributes
  5. Verified via the Grid Engine command "qconf -sc" that the newly created complex attributes exist
  6. The command "./qlicserver -C" has used to generate the SGE command that can be used to initialize the values of the new consumable resource attributes.
  7. Use the Grid Engine command displayed when running "./qlicserver -C" to set initial values for the newly created consumable resource attributes
  8. Verified via the SGE command "qstat -F <attribute>" that there are now values associated with license tracking attributes

If the above steps have been successfully completed, testing can begin!

Initial testing -- first query FlexLM to see the "real" status output

We want to see the current status data directly from FlexLM to learn what values qlicserver should be picking up.

$ lmutil lmstat -a -c $LM_LICENSE_FILE
lmutil - Copyright (c) 1989-2004 by Macrovision Corporation. All rights reserved.
Flexible License Manager status on Wed 11/16/2005 15:39

License server status: 27000@gw
    License file(s) on gw: /opt/schrodinger/license:

        gw: license server UP (MASTER) v9.5

Vendor daemon status (on gw):

    SCHROD: UP v9.5

Feature usage info:

Users of IMPACT_GLIDE:  (Total of 40 licenses issued;  Total of 0 licenses in use)

Users of MAESTRO_MAIN:  (Total of 1000 licenses issued;  Total of 0 licenses in use)

Users of MMLIBS:  (Total of 1000 licenses issued;  Total of 0 licenses in use)

Users of IMPACT_MAIN:  (Total of 20 licenses issued;  Total of 0 licenses in use)

Ok, no change here. The important information is that there are 40 IMPACT_GLIDE tokens available and 20 IMPACT_MAIN tokens available.

When we next run qlicserver, we expect that it should pick these values up and automatically adjust the Grid Engine resource attribute values correctly.

Run qlicserver for the first time using the "-n" switch

Running "./qlicserver -n" will prevent any Grid Engine configuration commands from being executed. Consider this a way to perform a "dry run" test. To get a report, the "output=-" parameter is used to write to STDOUT.

Here we go!

$ ./qlicserver -n output=-
<?xml version="1.0"?>
<?qlicserver date="2009-10-13T09:52:56"?>
<qlicserver releaseDate="2008-10-12">
<!-- adjustment:
     qconf -mattr exechost complex_values i_glide=40,i_main=20 global
-->
 <query>
  <cluster name="dcore" root="/opt/sge-6s2u1" cell="default"/>
  <host>dcore-amd.sonsorol.net</host>
  <user>sgeadmin</user>
  <time epoch="1132173792">2005-11-16T15:43:12-0500</time>
 </query>
 <parameters>
  <env name="SGE_ROOT">/opt/sge-6s2u1</env>
  <env name="LM_LICENSE_FILE">/opt/schrodinger/license</env>
  <param name="output">-</param>
 </parameters>
 <resources>
  <resource name="i_glide" served="IMPACT_GLIDE" total="40" free="40"/>
  <resource name="i_main" served="IMPACT_MAIN" total="20" free="20"/>
 </resources>
</qlicserver>

Success!

There are 3 important things to note about the output shown above:

  1. The qlicserver script has successfully queried FlexLM and got accurate license counts (i_main=40, i_glide=20)
  2. The qlicserver script has queried Grid Engine and realized that the current SGE values (i_main=0, i_glide=0) do not match what FlexLM is reporting
  3. The qlicserver script has generated the SGE command "qconf -mattr exechost complex_values i_glide=40, i_main=20 global" which will instantly revise the SGE i_main and i_glide parameters to reflect the data collected from the FlexLM license server. Because we launched the script with the "-n" switch, this action was not actually performed.

The only things left to do now are:

  • Run "qlicserver" persistently in daemon mode
  • Start submitting licensed jobs

Running the qlicserver daemon

At this point, the following steps have been completed:

  1. qlicserver has been configured
  2. Grid Engine license tracking resource attributes have been created, initialized and validated
  3. qlicserver has been tested in "dry run" mode via the "-n" switch

If the above steps worked, the qlicserver code can now be run as a persistant daemon with the "-d" switch. The default polling interval is set at 30 seconds, which will be overridden by the "delay" parameter. The "timeout" parameter is not used for polling, it is used to set the timeout value for FlexLM server communication attempts.

In the following example, a 40 second polling interval is used:

 
$ ./qlicserver -d delay=40

$ ps ax | grep qlicserver
14546 ?        Ss     0:00 /usr/bin/perl -w ./qlicserver -d delay=40

Running qlicserver this way will not create any output. To create a report, especially useful for initial testing and troubleshooting, use the "output" parameter. When testing is complete, qlicserver can be stopped and restarted with the final parameters.

Cavaet: the options must appear before the command-line parameters. (e.g., 'qlicserver delay=40 -d' is wrong).

$ ./qlicserver -d delay=40

Stopping the qlicserver daemon

The following command can be used to stop qlicserver when it is operating in daemon mode:

$ ./qlicserver -k


Assistance & Help

Please direct questions to the users@gridengine.sunsource.net mailing list or if necessary directly to Chris Dagdigian (<dag@sonsorol.org>).

Please feel free to improve upon this Wiki page as well!

Personal tools
Namespaces

Variants
Actions
GridWiki Navigation
Toolbox