Grid Engine 6 FlexLM License Integration (Olesen Method)
- This document was last updated: ----fx Tue May 4 17:20:50 BST 2010
- Computational chemistry software from Schrödinger was extensively used during the creation and testing of this document and the methods it describes. It is difficult to write about FLEXlm license integration without having licensed applications onhand to test with. The authors of this document would like to acknowledge and thank Schrödinger for their generous donation of software and licenses.
Integrating license applications into Grid Engine systems is pretty easy if:
- The application is unrestricted due to site-wide or enterprise licensing agreements
- The application is node-locked to particular cluster nodes
- The available license pool is larger than the number of job slots managed by Grid Engine
- The license server and application is exclusive to the Grid Engine system (no external usage at all)
If you need to integrate a licensed application into a Grid Engine (aka N1GE) managed system, and any of the above conditions applies to your situation then consider yourself lucky as you will not need to use the methods described in this document. Node-locked applications can be handled via simple configuration of Grid Engine requestable resources and dedicated cluster license servers (where the cluster nodes are the only possible consumers of license tokens) can be easily handled via user requestable, consumable resources. Search the Grid Engine or N1GE documentation for the words highlighted in bold for specific implementation details.
This document is for people who need to handle multiple license servers scattered throughout an enterprise or instances where the Grid Engine cluster is not the sole consumer of license tokens. This document describes a new method of making the Grid Engine scheduler and resource allocation policy mechanisms aware of the ever changing license availibility data.
Readers are encouraged to follow and read the links located in the Primary References section. There is a significant amount of intro and background material contained in those links that is not duplicated here.
Mark Olesen, posting on the Grid Engine users list, made some very interesting comments about FlexLM license integration when Grid Engine load sensors are used for resource reporting. In particular, he proposed a method by which the entire load sensor process could be bypassed altogether in favor of methods that directly adjust the value of user requestable consumable resources within the Grid Engine complex.
Mark also mentioned that he was willing to share his code but that he did not have the time to handle user questions or any sort of support related duties. Chris Dagdigian volunteered to provide support via the http://gridengine.info site and the firstname.lastname@example.org mailing list.
- Configuration notes for the qlicserver
- A presentation from the 2007 workshop describing the problem and solution
- The original document from Mark Olesen describing his solution to the FlexLM integration problems
- 'Cluster Tricks: Grid Engine License Juggling' (non technical introduction)
- 'Simple integration of FLEXlm-licensed applications into Grid Engine managed clusters'
Additional Reference Material
- 'N1GE 6 Administration Guide: Chapter 3 - Configuring Complex Resource Attributes
- Grid Engine HOWTO on 'Consumables'
- Grid Engine HOWTO on 'Setting up a load sensor in Grid Engine'
- X-Formation License Statistics - A lowcost solution for FLEXlm/FLEXNet license monitoring which generates realtime html, graphs, rrdtool graphs and allows system admin to be emailed upon various license conditions.
The Proposed Methods
Problems With The Load Sensor Approach
An obvious problem with the load sensor approach is the delay associated with the load reports, as mentioned in the online documentation (see http://gridengine.sunsource.net/project/gridengine/howto/resource_management.html):
Unfortunately, due to the loadsensor's delay, it can't be 100% excluded that batch jobs are dispatched and started although the license has been aquired by an interactive job.
The problem is actually much more serious than suggested by this warning! A race condition between a GridEngine job and an interactive job is less problematic than what actually occurs.
In the following examples, we'll examine how the licenses are managed with different approaches. For the sake of clarity, a new pseudo-variables internal_count and available have been introduced to reflect the current internal GridEngine state. The other variables - complex_values and load_values - are retrieved via qconf -se global.
The Pure Load Sensor Approach
Here the complex_values are left as NONE. The license availability is managed exclusively over the load sensor. This combination has the interesting side-effect that the internal bookkeeping is not used.
Consider the following:
Start: all licenses are available load_values license=4 complex_values NONE (internal_count) NONE (available) license=4 Then: launch X jobs, each with -l license=4
Since all nodes provide resource license=4 and there is no internal bookkeeping to track the consumption of the resource, all jobs attempt to start at the same scheduling interval. Only one job wins the race and others fail with licensing problems.
A Combined Internal and Load Sensor Approach
Here the complex_values are set to the number of licenses available. The GridEngine decides on availability based on complex_values minus (internal_count) or the load_values. The lowest value dictates the availability, as mentioned in the online documentation (see http://gridengine.sunsource.net/project/gridengine/howto/loadsensor.html):
The lesser of the Consumable Resources or the load sensor value will be used to prevent license oversubscription.
Consider the following:
Start: all licenses are available load_values license=4 complex_values license=4 (internal_count) NONE (available) license=4 Then: launch two jobs, each with -l license=4:
Since complex_values exist, the internal bookkeeping is used to track license availability and only one job is dispatched:
load_values license=4 complex_values license=4 (internal_count) license=4 (available) license=0
After some delay, the load sensor will catch up to the current status.
load_values license=0 complex_values license=4 (internal_count) license=4 (available) license=0
When the first job finishes, the internal count will increase.
load_values license=0 complex_values license=4 (internal_count) license=0 (available) license=0
After some delay, the load sensor will catch up to the current status and the second job can start.
load_values license=4 complex_values license=4 (internal_count) license=4 (available) license=4
Despite some delays associated with the load sensor, only a single job is started and this approach seems to be behaving as expected. However, the bookkeeping becomes less robust when non-GridEngine usage is tracked too!
Start: all licenses are available load_values license=4 complex_values license=4 (internal_count) NONE (available) license=4 Then: start a non-GridEngine job that occupies 2 licenses
After a delay, the load sensor reports that only two licenses are available.
load_values license=2 complex_values license=4 (internal_count) NONE (available) license=2 Then: launch two jobs via the GridEngine, each with -l license=2:
Since there are only 2 licenses available, and internal bookkeeping tracks the resource consumption, only one job is started at the first scheduling interval. The internal count is incremented accordingly:
load_values license=2 complex_values license=4 (internal_count) license=2 (available) license=2
At the next scheduling interval, there are still 2 licenses available (the lower limit of the internal bookkeeping and the external load report) and the second job will be started. This job will fail with licensing problems.
It is obvious from the above examples that these approaches cannot work correctly with a mixed license usage.
A Proposed Solution
The only obvious solution to the problem is to change the load sensor so that it does not report any values at all, but instead adjusts the complex_values directly.
Start: all licenses are available load_values NONE complex_values license=4 (internal_count) NONE (available) license=4 Then: start a non-GridEngine job that occupies 2 licenses:
After a delay, the load sensor adjusts the number of licenses available for the GridEngine.
load_values NONE complex_values license=2 (internal_count) NONE (available) license=2
Apart from the delay inherent with the load sensor approach, there is no internal race condition and we've thus eliminated the significant failings of the previous problems. Small problems still exist, but at least the worst problems have been addressed.
Understand The License Terms
Mark Olesen has released his code under a Creative Commons license:
QUICK SUMMARY You are free: * to copy, distribute, display, and utilize the work * to make derivative works Under the following conditions: * You must attribute the work and leave copyrights intact. * You may not use this work for commercial purposes (i.e., sell it) -- unless you get the licensor's permission. * If you alter, transform, or build upon this work, you may distribute the resulting work only under a license identical to this one.
For the full license terms, visit http://creativecommons.org/licenses/by-nc-sa/2.5/
Please note that this code is provided as a courtesy to other users with absolutely no guarantees! Usage questions should be posted to the email@example.com mailing list - please do not email the author directly.
- The newest qlicserver version can be found at http://olesenm.github.com/flex-grid/ along with the revision history. See #Obtain the software. The tar archive unpacks into the following file structure, the most important file is 'qlicserver':
flex-grid/ flex-grid/COPYING flex-grid/README.txt flex-grid/scripts/filter-accounting flex-grid/scripts/qlic flex-grid/site/qlicserver flex-grid/site/qloadsensor flex-grid/site/qlicserver.limit.EXAMPLE/foam flex-grid/site/qlicserver.limit.EXAMPLE/gtpower flex-grid/site/qlicserver.limit.EXAMPLE/starcd flex-grid/site/qlicserver.config.EXAMPLE ...
The utility program 'qlic' lists the contents of the qlicserver license data cache file with some pretty printing. Example output:
host = dealog01 age = 0:00:47 feature total limit extern intern wait free ------- ----- ----- ------ ------ ---- ---- abaqus 12 . 5 5 10 2 foam 40 . * 16 . 24 gtise 5 . 2 * . 3 gtpower 8 4 . - . 4 hexa 2 . . * . 2 nastran 3 . . 3 . . ...
Non-managed licenses are marked with '*'.
usage: qlic [OPTION] qlic [OPTION] resource=limit .. resource=limit with options: -a combined with '-u' = include active jobs -c FILE alternative file to parse -C FILE alternative location for the license limit -d dump license cache in raw format -D dump license cache in perl format -f display free licenses only -l list license limit -q display free licenses via qhost query -u license usage via 'lacct' -U license usage per user via 'lacct -u' -w show who/where ('' indicates waiting jobs) -h this help * extract / display information for the GridEngine license cache /opt/grid/default/site/cache/qlicserver.xml * adjust / display information for the license limits /opt/grid/default/site/qlicserver.limits copyright (c) 2003-09 <Mark.Olesen@emconTechnologies.com> Licensed and distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License. http://creativecommons.org/licenses/by-nc-sa/3.0
Installing and configuring 'qlicserver'
Step by Step
This step-by-step guide will use FlexLM licensed computational chemistry software produced by Schrödinger for this example. It is difficult to write about FlexLM license integration without actually having licensed applications onhand to aid in testing and troubleshooting. The authors of this document would like to acknowledge and thank Schrödinger for providing software and a range of floating license tokens.
Schrödinger "Glide" software is interesting in how it is licensed. Each instance of a Glide application running on a cluster node will consume 4x "IMPACT_GLIDE" license tokens and 1x "IMPACT_MAIN" token. This means that in order to integrate Glide into a Grid Engine cluster it is necessary to make Grid Engine constantly aware of the number of available "IMPACT_GLIDE" and "IMPACT_MAIN" resources.
License servers can be queried by vendor supplied tools, in this case the output of the Schrödinger vendor license server looks like this:
$ /opt/schrodinger/licadmin STAT lmutil - Copyright (c) 1989-2004 by Macrovision Corporation. All rights reserved. Flexible License Manager status on Tue 11/15/2005 19:33 License server status: 27000@gw License file(s) on gw: /opt/schrodinger/license: gw: license server UP (MASTER) v9.5 Vendor daemon status (on gw): SCHROD: UP v9.5 Feature usage info: Users of IMPACT_GLIDE: (Total of 40 licenses issued; Total of 0 licenses in use) Users of MAESTRO_MAIN: (Total of 1000 licenses issued; Total of 0 licenses in use) Users of MMLIBS: (Total of 1000 licenses issued; Total of 0 licenses in use) Users of IMPACT_MAIN: (Total of 20 licenses issued; Total of 0 licenses in use)
In this case we are only interested in the status of the IMPACT_GLIDE and IMPACT_MAIN tokens.
Obtain the software
$ wget http://github.com/olesenm/flex-grid/tarball/master ... 2009-10-13 09:45:50 (101 KB/s) - 'olesenm-flex-grid-3e55c61.tar.gz' saved [27772/27772] $ zcat olesenm-flex-grid-3e55c61.tar.gz | tar xvf -
Since we grabbed a tarball from github, the repository owner is prefixed and the commit number is appended to the end. We'll simply rename the directory to get rid of those:
$ mv olesenm-flex-grid-3e55c61 flex-grid
In this example, the SGE admin username is 'sgeadmin'. The qlicserver application will be installed into /home/sgeadmin/qlicserver.
Running qlicserver this way will not generate any output. At least for testing and troubleshooting, the output should send to a file or STDOUT via the "output" parameter. Sending to STDOUT for example,
$ cd flex-grid/site $ ./qlicserver -n output=-
Ready for testing
At this point, the following steps have been completed:
- Verified that the command "lmutil lmstat -a -c $LM_LICENSE_FILE" works via the command line
- qlicserver downloaded and configured to monitor the appropriate license tokens. The command "./qlicserver -i" can be helpful here.
- The command "./qlicserver -c" used to generate the text necessary to register the new consumable resource attributes within Grid Engine.
- Used the Grid Engine command "qconf -mc", paste in text from step #3 to create the new consumable resource attributes
- Verified via the Grid Engine command "qconf -sc" that the newly created complex attributes exist
- The command "./qlicserver -C" has used to generate the SGE command that can be used to initialize the values of the new consumable resource attributes.
- Use the Grid Engine command displayed when running "./qlicserver -C" to set initial values for the newly created consumable resource attributes
- Verified via the SGE command "qstat -F <attribute>" that there are now values associated with license tracking attributes
If the above steps have been successfully completed, testing can begin!
Initial testing -- first query FlexLM to see the "real" status output
We want to see the current status data directly from FlexLM to learn what values qlicserver should be picking up.
$ lmutil lmstat -a -c $LM_LICENSE_FILE lmutil - Copyright (c) 1989-2004 by Macrovision Corporation. All rights reserved. Flexible License Manager status on Wed 11/16/2005 15:39 License server status: 27000@gw License file(s) on gw: /opt/schrodinger/license: gw: license server UP (MASTER) v9.5 Vendor daemon status (on gw): SCHROD: UP v9.5 Feature usage info: Users of IMPACT_GLIDE: (Total of 40 licenses issued; Total of 0 licenses in use) Users of MAESTRO_MAIN: (Total of 1000 licenses issued; Total of 0 licenses in use) Users of MMLIBS: (Total of 1000 licenses issued; Total of 0 licenses in use) Users of IMPACT_MAIN: (Total of 20 licenses issued; Total of 0 licenses in use)
Ok, no change here. The important information is that there are 40 IMPACT_GLIDE tokens available and 20 IMPACT_MAIN tokens available.
When we next run qlicserver, we expect that it should pick these values up and automatically adjust the Grid Engine resource attribute values correctly.
Run qlicserver for the first time using the "-n" switch
Running "./qlicserver -n" will prevent any Grid Engine configuration commands from being executed. Consider this a way to perform a "dry run" test. To get a report, the "output=-" parameter is used to write to STDOUT.
Here we go!
$ ./qlicserver -n output=- <?xml version="1.0"?> <?qlicserver date="2009-10-13T09:52:56"?> <qlicserver releaseDate="2008-10-12"> <!-- adjustment: qconf -mattr exechost complex_values i_glide=40,i_main=20 global --> <query> <cluster name="dcore" root="/opt/sge-6s2u1" cell="default"/> <host>dcore-amd.sonsorol.net</host> <user>sgeadmin</user> <time epoch="1132173792">2005-11-16T15:43:12-0500</time> </query> <parameters> <env name="SGE_ROOT">/opt/sge-6s2u1</env> <env name="LM_LICENSE_FILE">/opt/schrodinger/license</env> <param name="output">-</param> </parameters> <resources> <resource name="i_glide" served="IMPACT_GLIDE" total="40" free="40"/> <resource name="i_main" served="IMPACT_MAIN" total="20" free="20"/> </resources> </qlicserver>
There are 3 important things to note about the output shown above:
- The qlicserver script has successfully queried FlexLM and got accurate license counts (i_main=40, i_glide=20)
- The qlicserver script has queried Grid Engine and realized that the current SGE values (i_main=0, i_glide=0) do not match what FlexLM is reporting
- The qlicserver script has generated the SGE command "qconf -mattr exechost complex_values i_glide=40, i_main=20 global" which will instantly revise the SGE i_main and i_glide parameters to reflect the data collected from the FlexLM license server. Because we launched the script with the "-n" switch, this action was not actually performed.
The only things left to do now are:
- Run "qlicserver" persistently in daemon mode
- Start submitting licensed jobs
Running the qlicserver daemon
At this point, the following steps have been completed:
- qlicserver has been configured
- Grid Engine license tracking resource attributes have been created, initialized and validated
- qlicserver has been tested in "dry run" mode via the "-n" switch
If the above steps worked, the qlicserver code can now be run as a persistant daemon with the "-d" switch. The default polling interval is set at 30 seconds, which will be overridden by the "delay" parameter. The "timeout" parameter is not used for polling, it is used to set the timeout value for FlexLM server communication attempts.
In the following example, a 40 second polling interval is used:
$ ./qlicserver -d delay=40 $ ps ax | grep qlicserver 14546 ? Ss 0:00 /usr/bin/perl -w ./qlicserver -d delay=40
Running qlicserver this way will not create any output. To create a report, especially useful for initial testing and troubleshooting, use the "output" parameter. When testing is complete, qlicserver can be stopped and restarted with the final parameters.
Cavaet: the options must appear before the command-line parameters. (e.g., 'qlicserver delay=40 -d' is wrong).
$ ./qlicserver -d delay=40
Stopping the qlicserver daemon
The following command can be used to stop qlicserver when it is operating in daemon mode:
$ ./qlicserver -k
Assistance & Help
Please direct questions to the firstname.lastname@example.org mailing list or if necessary directly to Chris Dagdigian (<email@example.com>).
Please feel free to improve upon this Wiki page as well!