SGE-Hedeby-And-Amazon-EC2

From GridWiki
Jump to: navigation, search

Contents

About this document

The purpose of this page is to document a recent "proof of concept" project conducted in order to show how Sun Service Domain Management ("SDM") tools (aka "Project Hedeby", http://hedeby.sunsource.net) can be used in cooperation with the Hedeby Grid Engine Service Adapter (http://hedeby-ge-adapter.sunsource.net/) and the new Sun Grid Engine 6.2 (with embedded JVM and JMX API features) to automatically harness additional resources from within the Amazon Elastic Compute Cloud ("EC2").

Project Goals:

  1. Set up a single dedicated internet server acting as both a Hedeby and Grid Engine 6.2 "master" node
  2. Construct an Amazon EC2 AMI machine image capable of automatically configuring itself as a Hedeby "managed node" and automatically registering with the "spare_pool" service on the Hedeby master
  3. Show how Hedeby can pull nodes from the spare_pool and reprovision them without human intervention into Grid Engine execution hosts according to configured Service Level Objectives (SLO's)

The end result is a demonstration of how Hedeby Service Domain Management tools and Grid Engine 6.2 can be configured to automatically harness on-demand server resources within the Amazon cloud computing infrastructure.

Scope Constraints

  • Amazon Elastic Compute Cloud (EC2) - This document assumes familiarity with Amazon Web Services, especially with EC2 and how custom server images are created, managed and controlled. If you want to become familiar with EC2, point a web browser at http://aws.amazon.com. Amazon has published excellent step-by-step "Getting Started" guide which is available by following the links from Developer Connection -> Resource Center -> Elastic Compute Cloud and finally clicking on the yellow "Getting Started" button. This direct link may help: http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=84
  • Amazon Machine Image ("AMI") - The custom CentOS Linux AMI used for this project can not be made public as it contains hard-coded references to our dedicated master server as well as security credentials associated with our personal Amazon AWS account. Rather than providing a public AMI image, this document should allow individuals to construct their own customized Hedeby-aware AMI image for use within EC2.


Results

Video Screencasts

Demonstration - 6 EC2 Servers automatically joining Hedeby resource provider

This is a short video demonstrating how a single launch command causes 6 EC2-based servers to be provisioned, booted and automatically joined with the Hedeby resource provider:

short-demo-cap.png


Behind The Scenes

This screen recording shows the process of automating the transformation of a newly launched EC2 AMI into a "Hedeby Managed Host":

auto-provisioning-details-cap.png

HowTo: Setup the Grid Engine 6.2 Master

The dedicated internet-accessable 'master' for this project is a host known as cloudseeder.bioteam.net

There is nothing particularly special about the Grid Engine 6.2 installation on this host except for the fact that it was explicitly installed with the Java JMX features enabled. This feature is not enabled by default and must be selected at install time. The procedure is well documented at http://wikis.sun.com/display/GridEngine/Installing+a+JMX-Enabled+System The only other item of note is that the Certificate Authority created in /var/sgeCA needs to be replicated into the Amazon AMI machine image since the remote cloud nodes will not be NFS-mounting any filesystems on the dedicated master.

Key configuration parameters used:

  • Admin user 'sgeadmin'
  • SGE root is /opt/sge/
  • Cluster name is "p6444"
  • Classic spooling selected
  • SGE JMX JGDI port: 6443
  • SGE Qmaster port: 6444
  • SGe Execd port: 6445

Example sgeCA layout:

[root@cloudseeder common]# ls -l /var/sgeCA/port6444/default/
total 8
-rw-r--r--  1 root     root        0 Jul 10 12:19 lock
drwx------  2 sgeadmin sgeadmin 4096 Jul 10 12:19 private
drwxr-xr-x  4 sgeadmin sgeadmin 4096 Jul 10 12:54 userkeys
[root@cloudseeder common]# ls -l /var/sgeCA/port6444/default/userkeys/
total 8
drwxr-xr-x  2 root     sgeadmin 4096 Jul 10 12:52 root
drwxr-xr-x  2 sgeadmin sgeadmin 4096 Jul 10 12:54 sgeadmin
[root@cloudseeder common]# ls -l /var/sgeCA/port6444/default/userkeys/sgeadmin/
total 20
-rwxr-xr-x  1 sgeadmin sgeadmin 1432 Jul 10 12:54 cert.pem
-rwxr-xr-x  1 sgeadmin sgeadmin  887 Jul 10 12:54 key.pem
-rwxr-xr-x  1 sgeadmin root     2792 Jul 10 12:54 keystore
-rwxr-xr-x  1 sgeadmin sgeadmin 1024 Jul 10 12:54 rand.seed
-rwxr-xr-x  1 sgeadmin sgeadmin  777 Jul 10 12:54 req.pem
[root@cloudseeder common]#

After successful installation, SGE shows a single active system:

[root@cloudseeder ~]# qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@cloudseeder.bioteam.net  BIP   0/0/2          0.00     lx24-x86      
[root@cloudseeder ~]#


HowTo: Setup the Hedeby SDM Master

Initial Install

After creating a user account named 'hedeby' and creating the environment variable SDM_SYSTEM=hedeby1, the following command was run as the root user to install the Hedeby master host software:

sdmadm -s hedeby1 -p system install_master_host       \
-ca_admin_mail chris\@bioteam.net -ca_org bioteam.net \ 
-ca_org_unit bioteam-labs -ca_country US -au hedeby   \ 
-sge_root /opt/sge/ -ca_location caLOC -cs_port 6446  \
-ca_state MA

This screen recording shows the installation of the Hedeby Master host software:

master-host-install-cap.png

Starting the Java VMs

After initial install, the following command was used to start the JVMs:

# sdmadm suj

Configuring static TCP ports for Hedeby communication

In order to support a tighter network security and firewall access profile within Amazon EC2, Hedeby was configured to use specific source TCP ports. This was done via the following command:

# sdmadm mgc

The goal is to hard-code the use of specific TCP ports so that firewall rules can be tightened:

  • port 6446 - Hedeby CS JVM
  • port 6447 - Hedeby Executor JVM
  • port 6448 - Hedeby Resource Provider JVM

Using the editor, our Hedeby system was configured with the following information, the only parameters changed were port setting values:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<common:global name="hedeby1"
               xmlns:reporter="http://hedeby.sunsource.net/hedeby-reporter"
               xmlns:common="http://hedeby.sunsource.net/hedeby-common"
               xmlns:executor="http://hedeby.sunsource.net/hedeby-executor"
               xmlns:resource_provider="http://hedeby.sunsource.net/hedeby-resource-provider"
               xmlns:ge_adapter="http://hedeby.sunsource.net/hedeby-gridengine-adapter"
               xmlns:security="http://hedeby.sunsource.net/hedeby-security">
    <common:jvm port="6446"
                user="hedeby"
                name="cs_vm">
        <common:jvmArg>-Xmx128M</common:jvmArg>
    </common:jvm>
    <common:jvm port="6447"
                user="root"
                name="executor_vm">
        <common:component xsi:type="common:MultiComponent"
                          autostart="true"
                          classname="com.sun.grid.grm.executor.impl.ExecutorImpl"
                          name="executor"
                          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
            <common:hosts>
                <common:include>.*</common:include>
            </common:hosts>
        </common:component>
        <common:component xsi:type="common:Singleton"
                          host="cloudseeder.bioteam.net"
                          autostart="true"
                          classname="com.sun.grid.grm.security.ca.impl.GrmCAServiceDelegate"
                          name="ca"
                          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"/>
        <common:jvmArg>-Xmx32M</common:jvmArg>
    </common:jvm>
    <common:jvm port="6448"
                user="hedeby"
                name="rp_vm">
        <common:component xsi:type="common:Singleton"
                          host="cloudseeder.bioteam.net"
                          autostart="true"
                          classname="com.sun.grid.grm.resource.impl.ResourceProviderImpl"
                          name="resource_provider"
                          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"/>
        <common:component xsi:type="common:Singleton"
                          host="cloudseeder.bioteam.net"
                          autostart="true"
                          classname="com.sun.grid.grm.resource.impl.ResourceProviderImpl"
                          name="resource_provider"
                          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"/>
        <common:component xsi:type="common:Singleton"
                          host="cloudseeder.bioteam.net"
                          autostart="true"
                          classname="com.sun.grid.grm.reporting.impl.ReporterImpl"
                          name="reporter"
                          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"/>
        <common:component xsi:type="common:Singleton"
                          host="cloudseeder.bioteam.net"
                          autostart="true"
                          classname="com.sun.grid.grm.sparepool.SparePoolServiceImpl"
                          name="spare_pool"
                          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"/>
        <common:component xsi:type="common:Singleton"
                          host="cloudseeder.bioteam.net"
                          classname="com.sun.grid.grm.service.impl.ge.GEServiceDelegate"
                          name="ge"
                          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"/>
        <common:jvmArg>-Xmx128M</common:jvmArg>
    </common:jvm>
</common:global>

Allowing root to run Hedeby admin commands

As installed, the root user can only start the Hedeby JVMs. Administrative privileges are restricted only to the hedeby user account. It is more convenient that root be allowed to run Hedeby commands so following directions published online at http://blogs.sun.com/rhierlmeier/entry/howto_make_root_to_an

The process boiled down to issuing the following commands as user 'hedeby':

$ sdmadm add_admin_user -au root
$ sdmadm -ppw add_admin_user_cert -e root@cloudseeder.bioteam.net -au root
$ sdmadm -ppw update_keystore -n root -t use

Installing the Grid Engine Hedeby Service Adapter

The following command was run on the Hedeby master to activate the "GE Service":

sdmadm add_ge_service -s ge -h cloudseeder.bioteam.net -j rp_vm -start

It is vitally important that the GE Adapter be configured correctly with the proper information to match the Grid Engine 6.2 installation. The configuration used for this project was:

<ge_adapter:connection keystore="/var/sgeCA/port6444/default/userkeys/sgeadmin/keystore"
                           password=""
                           username="root"
                           jmxPort="6443"
                           execdPort="6445"
                           masterPort="6444"
                           cell="default"
                           root="/opt/sge"
                           clusterName="p6444"/>

  <ge_adapter:sloUpdateInterval unit="minutes"
                                  value="5"/>

  <ge_adapter:execd adminUsername="root"
                      defaultDomain=""
                      ignoreFQDN="false"
                      rcScript="true"
                      adminHost="true"
                      submitHost="true"
                      cleanupDefault="true"/>

Originally we had configured the execd adminUsername to be "sgeadmin" to match our actual Grid Engine installation. That resulted in permission errors that prevented the automatic provisioning. These errors are discussed in the Hedeby mailing list thread http://hedeby.sunsource.net/servlets/BrowseList?list=users&by=thread&from=47203 -- until clarification is received, it is strongly recommended that the 'root' user be used within the GE Service Adapter configuration.


HowTo: Testing the Hedeby Master

Before beginning work with Amazon EC2, make sure the SGE and Hedeby master system is functioning properly. Any unusual errors or JVM states should be investigated and resolved before proceeding.

Grid Engine

[root@cloudseeder ~]# qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@cloudseeder.bioteam.net  BIP   0/0/2          0.00     lx24-x86      
[root@cloudseeder ~]# 
[root@cloudseeder ~]# qrsh /bin/hostname
cloudseeder.bioteam.net

[root@cloudseeder ~]#

Hedeby JVMs

[root@cloudseeder ~]# sdmadm show_jvm
name        host                    state      used_mem  max_mem   message                                 
-------------------------------------------------------------------------------
cs_vm       cloudseeder.bioteam.net STARTED           7M      123M                                         
executor_vm cloudseeder.bioteam.net STARTED           4M       30M                                         
rp_vm       cloudseeder.bioteam.net STARTED          10M      123M                                         
[root@cloudseeder ~]#

Hedeby Services

[root@cloudseeder ~]# sdmadm show_service
host                    service    cstate  sstate 
--------------------------------------------------
cloudseeder.bioteam.net ge         STARTED RUNNING
                        spare_pool STARTED RUNNING
[root@cloudseeder ~]#

Hedeby Componants

[root@cloudseeder ~]# sdmadm show_component
host                    jvm         component         type             state       
-----------------------------------------------------------------------------------
cloudseeder.bioteam.net executor_vm ca                Other            STARTED     
                                    executor          Executor         STARTED     
                        rp_vm       ge                Service          STARTED     
                                    reporter          Other            STARTED     
                                    resource_provider ResourceProvider STARTED     
                                    spare_pool        Service          STARTED     
[root@cloudseeder ~]# 

Hedeby Resources

[root@cloudseeder ~]# sdmadm show_resource
service id                      state    type flags usage annotation            
--------------------------------------------------------------------------------
ge      cloudseeder.bioteam.net ASSIGNED host S     50    Got execd update event
[root@cloudseeder ~]#


This short video shows how to manually add a remote host resource to the system:

managed-host-install-manual-cap.png

HowTo: Preparing the Amazon Machine Image (AMI)

Initial AMI Prep

Rather than building a server image from scratch for use within EC2, we started with a publicly available Centos-5 server image created by RightScale (www.rightscale.com). The folks at RightScale not only make public server images available they also fully publish and document their build and update scripts -- an incredibly useful resource for people learning how to follow in their footsteps. The RightScale Centos 5 images (available in both 32-bit and 64-bit versions) can be launched by anyone and come with all of the latest Amazon EC2 utilities and management software pre-installed. It is an excellent springboard for organizations and individuals starting out with Linux based EC2 systems.

The specific public AMI used for building our own private Hedeby/SGE system had AMI ID ami-08f41161 - this was used as a base for customizing, bundling and registering our own private AMI image containing the automatically deploying SDM and SGE software.

The Rightscale public CentOS AMI does not ship with the Net::Amazon::EC2 perl module, a required dependency of some or our deployment scrips so this was manually installed via standard perl and CPAN module utilities.

User accounts

User accounts for "sgeadmin" and "hedeby" were manually created with the same UID and GID values found on cloudseeder.bioteam.net

 sgeadmin:x:502:502::/home/sgeadmin:/bin/bash
 hedeby:x:3002:3002::/home/hedeby:/bin/bash

Replicate required files

The following files and directories were replicated from our Hedeby/SGE master via rsync-over-SSH to our initial Amazon EC2 development AMI:

 /var/sgeCA/ 	(SGE Certificate Authority)
 /opt/sdm/  	(Hedeby SDM software unpacked from tar.gz archive)
 /opt/sge/  	($SGE_ROOT)

In addition, individual certificate and keystore files were copied from the Hedeby master and "staged" into /opt/sdm-security/ so they can be referenced during the "managed node" SDM installation process.

Files copied to /opt/sdm-security on the AMI development image:

 /var/spool/sdm/hedeby1/security/ca/ca_top/cacert.pem
 /var/spool/sdm/hedeby1/security/ca/ca_top/hedeby.keystore
 /var/spool/sdm/hedeby1/security/ca/ca_top/root.keystore

HowTo: Deploying Hedeby "managed host" software on an EC2-based system

EC2 nodes instantiated and booted within the Amazon web services framework come online in a state that is generally unacceptable for error-free installation of both Grid Engine and the Hedeby SDM code. In the following sections the individual host configuration problems will be broken down and example "fix it" scripts are provided.



Note: All scripts and code in this page are quick and dirty "prototype quality" and have certainly not been optimized or thoroughly vetted. Use at your own risk.



EC2 Fix 1: Hostname and hosts file

EC2 nodes are lauched with the following characteristics:

  • Hostname is set to the private/unroutable name
  • The /etc/hosts file is essentially empy
  • DNS server and search path work only for internal EC2 hostnames

The "fix" in this case is pretty simple. A perl script that uses the NET::Amazon::EC2 module to query the EC2 environment in order to discern it's publicly assigned DNS hostname. Once the public hostname is determined from the EC2 instance reservation data, a DNS query to an external resolver provides the publicly accessible IP address for this system (all EC2 systems are behind a network gateway but 1:1 NAT routing is performed allowing each EC2 host to have a unique public IP address)

Now knowing the public DNS hostname and IP address for our newly-provisioned EC2 node we can simply alter the system hostname and manually recreate a more fleshed out /etc/hosts file.

A script that performs this procedure is provided below:

#!/usr/bin/perl

use Net::Amazon::EC2;

##--------------------------------------------------------------
##  This must be set to something valid ...

my $MY_AWS_ACCESS_KEY_ID     = "";
my $MY_AWS_SECRET_ACCESS_KEY = "";
my $EXTERNAL_DNS_SERVER = "";
##--------------------------------------------------------------


##
## This is a rough script for making EC2 server changes necessary
## to allow a full automatic installation of SGE and Hedeby SDM
## to occur. Use at your own risk. 
##
## This is the summary of changes that are made:
##
## (1) Replace existing /etc/hosts with one that explicitly
##     lists the public and private hostnames of the EC2 instance
##
## (2) Change the hostname of the EC2 instance from the private name
##     to the public "...amazonaws.com" name
##
##     Chris Dagdigian / BioTeam / chris@bioteam.net
##  

my $hosttype = shift;


print "(########) 1 \n";

# Learn our own internal hostname (which is the AWS internal name)
my $hostname            = `/bin/hostname` ;
chomp $hostname;
$sh_hostname = $hostname;
$hostname .= ".ec2.internal";

my $reservation_hosts = {};
my $host_reservations = {};
#my $aws_access_key_id     = $ENV{'AWS_ACCESS_KEY_ID'};
#my $aws_secret_access_key = $ENV{'AWS_SECRET_ACCESS_KEY'};

my $ec2 = Net::Amazon::EC2->new(
                AWSAccessKeyId => $MY_AWS_ACCESS_KEY_ID,
                SecretAccessKey => $MY_AWS_SECRET_ACCESS_KEY
				);

print "(########) 2 \n";

my $instances = $ec2->describe_instances;

if (ref($instances) eq "Net::Amazon::EC2::Errors") {
    print "\nAWS ERROR!\n";
    print "request_id: $instances->{request_id}\n";
    foreach $error (@{$instances->{errors}}) {
	print "code: $error->{code}\n";
	print "message: $error->{message}\n";
    }
}


foreach my $reservation (@$instances) {
    foreach my $instance ($reservation->instances_set) {

	$instance_hostname = $instance->private_dns_name;
	$public_hostname   = $instance->dns_name;
	($sh_public_hostname,$junk)  = split(/\./,$public_hostname);

        ## Internal hostnames have different subdomains 
        ## (.ec2.internal vs. compute-1.internal etc) 
        ## ... so lets delete anything after the first "." char in the hostnames ...
	$instance_hostname =~ s/\.(.*)//g;
	$hostname =~ s/\.(.*)//g;

        #print "  Instance name:  $instance_hostname\n";
        #print "  DNS name     :  $hostname\n";
        #print "  Short public :  $sh_public_hostname \n";
	print "(########) Comparing:  $instance_hostname vs. $hostname \n";
	if ($hostname eq $instance_hostname) { 

	        # If we get here than we are working our own instance
	    print "My internal hostname: $hostname\n";
	    print "My instance hostname: $instance_hostname\n";
	    print "My public hostname  : $public_hostname\n";

            $dnsAnswer =  `host $public_hostname $EXTERNAL_DNS_SERVER`;
	    chomp($dnsAnswer);
	    ($junk,$dnsreply) = split(/has\saddress\s+/,$dnsAnswer);
	    print "My public IP        : $dnsreply\n";

	    $privateIPAnswer = `/usr/bin/host $sh_hostname`;
	    chomp($privateIPAnswer);
	    ($junk,$privateIP) = split(/has address\s+/,$privateIPAnswer);


	    print "---- System Changes to be made ... ---\n";
	    print "[1] Adding the following to /etc/hosts on this node:\n";
            print "\n\n\t## Hacked up entry to allow SDM managed host install\n";
	    print "\t$privateIP\t\t$public_hostname $hostname $sh_hostname\n\n";
	    print "\t66.92.69.21\t\t cloudseeder.bioteam.net\n";

	    open(FH,">/etc/hosts") or die "can't open write to /etc/hosts";
	    print FH "## Modified by utility script \n";
	    print FH "127.0.0.1 localhost localhost.localdomain\n"; 
            ## Proper naming ...
	    print FH "## Proper naming\n";
            print FH "$dnsreply\t\t$public_hostname $sh_public_hostname\n";
	    print FH "$privateIP\t\t$hostname $sh_hostname\n";
	    ## Tweaked naming
            print FH "#\n\n## Hacked up entry to allow SDM managed host install\n";
	    print FH "#$privateIP\t\t$public_hostname $hostname $sh_hostname\n\n";
	    print FH "66.92.69.21\t\t cloudseeder.bioteam.net\n";
	    close(FH);

	    print "\n[2] Will change hostname of this node to $public_hostname\n\n";
	    `echo $sh_hostname > /etc/sysconfig/original_hostname`;
	    `/bin/hostname $public_hostname`;

	    print "\n[3] Write our public IP address to /etc/sysconfig/EC2-public-IPaddr (for NIC fix later)\n\n";
	    `echo $dnsreply > /etc/sysconfig/EC2-public-IPaddr`;

	        
	}

    }
}


EC2 Fix 2: Network interfaces

When an EC2 system is provisioned, only a single network interface exists (eth0) -- this interface is configured with a private, un-routable IP address that works only within the Amazon Web Services Infrastructure.

For access from the outside world, Amazon also assigns a public IP address to the EC2 host but the host is not made aware of this address at all. A 1:1 mapping of public to private IP is performed by some sort of NAT gateway operating within the AWS infrastructure allowing communication from the outside world to reach the EC2 system on the private AWS network.

The fact that the EC2 node is unaware of it's "public" IP address presents a fatal barrier to the operation of Hedeby/SDM code. Allowing the Java code to bind to the "private" IP address allows Hedeby SDM traffic to flow only in one direction (outward to the Hedeby master).

Changing the hostname of the EC2 system to the public hostname and adding an /etc/hosts entry with the new data is not sufficient. The Hedeby software will fail to install as the Java binding code will refuse to "bind to a non-local IP address ..."

One final step is needed after changing the EC2 system hostname and /etc/hosts file -- the creation of a new aliased network interface configured with the public IP address. The creation of a new "eth0:1" device configured with the public IP address will allow the Java JVMs to bind correctly without errors.

A script (using IP address information gathered by the previous script) that performs this action automatically is provided below, it creates a new network interface named "eth0:1" and properly configures it with the public IP address assigned to the host by the AWS infrastructure:

#!/usr/bin/perl

## Get EC2 hostname 
$PUBLIC_EC2_IP = `/bin/cat /etc/sysconfig/EC2-public-IPaddr`;

## Global
$HEDEBY_SYSTEM = "hedeby1";
$HEDEBY_SDMADM = "/opt/sdm/bin/sdmadm";

## Take down NIC eth0:1 if it exists ..
system("/sbin/ifdown eth0:1");

## Open filehandle to overwite NIC device eth0:1

open($fh, "> /etc/sysconfig/network-scripts/ifcfg-eth0:1") || die "Can't open ifcfg-eth0:1,perl says: $!\n";

print $fh <<EOF;
ONBOOT=yes
DEVICE=eth0:1
BOOTPROTO=static
NETMASK=255.255.255.0
IPADDR=$PUBLIC_EC2_IP
EOF

close($fh);

print "-------------------------------------------------\n";
print "EC2 Public IP NIC Fix\n";
print "Creating alias NIC eth0:1 with Amazon assigned public IP address\n";
print "-------------------------------------------------\n";
system("cat /etc/sysconfig/network-scripts/ifcfg-eth0:1");

print "Bringing up new interface eth0:1 ...\n";
system("/sbin/ifup eth0:1");

print "Done.\n";


Installing Hedeby SDM "Managed Node"

Assuming all of the files and replicated data have been moved into the AMI image and the "fixes" discussed above have been implemented, it is very easy to install the Hedeby SDM software to create a "managed node"

This is a three step process:

Install the software:

/opt/sdm/bin/sdmadm -s hedeby1 -p system -keystore /opt/security-sdm/hedeby.keystore  \
-cacert /opt/security-sdm/cacert.pem  install_managed_host  -au hedeby -d /opt/sdm -l \
/var/spool/sdm/hedeby1 -cs_url cloudseeder.bioteam.net:6446

Configure Hedby so that user 'root' can issue Hedeby administrative commands:

cp -f /opt/security-sdm/root.keystore /var/spool/sdm/hedeby1/security/users

Start up the Java JVMs:

/opt/sdm/bin/sdmadm -s hedeby1 suj

Automatically Registering with the Hedeby Master

Adding our new "managed host" to the set of resources available to our Hedeby resource provider is a simple matter of issuing the following command:

sdmadm add_resource

By default this command is "interactive" in that it requires human interaction via a text editor to plug in some required configuration parameters. A variant of the "add_resource" command can read in all of the required parameters from a file:

sdmadm ar -f ./path-to-template-file

A script that will automatically register the new host with the Hedeby master (no human interaction required) is provided below:

#!/usr/bin/perl

## Get EC2 hostname 
$hostname = `/bin/hostname`;

## Globals
$HEDEBY_SYSTEM = "hedeby1";
$HEDEBY_SDMADM = "/opt/sdm/bin/sdmadm";

## Template driven resource addition
use File::Temp qw/ tempfile tempdir /;
($fh, $filename) = tempfile();

print $fh <<EOF;
# Default values for host resource
#
resourceHostname = $hostname
static = false
# hardwareCpuArchitecture = <String, optional>
# hardwareCpuCount = <Integer, optional>
# hardwareCpuFrequency = <String, optional>
# operatingSystemName = <String, optional>
# operatingSystemPatchlevel = <String, optional>
# operatingSystemRelease = <String, optional>
# operatingSystemVendor = <String, optional>
# resourceIPAddress = <String, optional>
EOF

print "\nWill add this resource to Hedeby:\n";
print "  Resource Provider determines the service (spare_pool vs. GE service)";
print "-------------------------------------------------\n";
system("cat $filename");

## Here is where we add the resource ...
$HEDEBY_AR = "$HEDEBY_SDMADM -s $HEDEBY_SYSTEM ar -f $filename ";
system("$HEDEBY_AR");

## Debug
print "\n\n Listing Hedeby resources ...\n";
print " Command: sdmadm show_resource\n\n";
system("sdmadm sr");

Putting it all together (automating everything at first boot )

There are only 2 steps to automating all of the EC2 host prep and SDM installation/registration work. The end result is an AMI server image that will automatically "fix" problems, install the SDM manged node software and register as an available resource with our Hedeby master within minutes of being provisioned and launched.

The first step is to wrap all of our individual "fix" and "install" scripts into a single "do-everything" wrapper:

#!/bin/sh

export JAVA_HOME=/usr/java/default
export SDM_SYSTEM=hedeby1
export PATH=${PATH}:/opt/sdm/bin

echo "-----------------------------------------"
echo "Fixing hostname and /etc/hosts files ..."
/opt/scripts/00_pre-fix-Hosts-and-DNS.pl
echo "-----------------------------------------"

echo "-----------------------------------------"
echo "Creating aliased NIC device eth0:1 ..."
/opt/scripts/01_pre-add-IP-alias.pl
echo "-----------------------------------------"

echo ""
sleep 1

echo "Installing SDM software (managed host)"
echo "/opt/sdm/bin/sdmadm -s hedeby1 -p system  \
-keystore /opt/security-sdm/hedeby.keystore     \
-cacert /opt/security-sdm/cacert.pem  install_managed_host \
-au hedeby -d /opt/sdm -l /var/spool/sdm/hedeby1 -cs_url cloudseeder.bioteam.net:6446"

sleep 2

/opt/sdm/bin/sdmadm -s hedeby1 -p system     \
-keystore /opt/security-sdm/hedeby.keystore   \
-cacert /opt/security-sdm/cacert.pem  install_managed_host \
-au hedeby -d /opt/sdm -l /var/spool/sdm/hedeby1 -cs_url cloudseeder.bioteam.net:6446

## Add root keystore to install
echo "Copying root keystore so root can act as SDM admin ..."
echo "  cp -f /opt/security-sdm/root.keystore /var/spool/sdm/hedeby1/security/users"
sleep 1
cp -f /opt/security-sdm/root.keystore /var/spool/sdm/hedeby1/security/users

## Start up JVMs
echo "Trying to start up SDM JVMs ..."
echo "  /opt/sdm/bin/sdmadm -s hedeby1 suj"
sleep 1
/opt/sdm/bin/sdmadm -s hedeby1 suj


## Add this host to SDM
## We can't explicitly add to spare_pool, let Hedeby decide
## what service/component gets this new host ...

echo "-----------------------------------------"
echo "Registering this host with our SDM Resource Provider"
echo "-----------------------------------------"
/opt/scripts/add-ec2Host-to-resourceProvider.pl

The second step is the mechanism that launches the "do everything" script just after the server has been provisioned and launched. The method provided below is shell code appended to the bottom of the Linux /etc/rc.local file. Based on the presence or absence of a simple lockfile the auto-installation process is begun or skipped:

## Sun SDM managed node auto deployment  appended into /etc/rc.local

if [ -e /var/db/.HedebySDMConfigDone ]; then
    echo "####################"
    echo "/var/db/.HedebySDMConfigDone exists."
    echo "HedebySDM Auto-deployment  skipped."
    echo "####################"
else
    echo "####################"
    echo "/var/db/.HedebySDMConfigDone does not exist."
    echo "HedebySDM Auto-deployment starting."

   
    echo "Hostname before:" >> /root/node-deploy-log.txt
    /bin/hostname > /root/original-hostname.txt
    /bin/hostname >> /root/node-deploy-log.txt

    /opt/scripts/deploy-managed-host.sh >> /root/node-deploy-log.txt 2> /root/node-error-deploy.txt

    echo "Hostname after:" >> /root/node-deploy-log.txt
    /bin/hostname >> /root/node-deploy-log.txt

    touch /var/db/.HedebySDMConfigDone
    echo "    /var/db/.HedebySDMConfigDone created"
    echo "HedebySDM auto-deployment complete."
    echo "####################"
fi

Next Steps

Further Optimizations

As seen in the video screencast (http://www.screencast.com/t/29zWx7rF) showing how 6 EC2 nodes can be trivially and automatically added to the resource provider, there was actually one step a human had to initiate -- the actual act of deciding how many EC2 servers to launch and the command that launched them. Everything after the EC2 instance launch happens automatically.

It would be nice to totally automate this. One idea involves a 'watchdog' progess that monitors the size of the SGE pending job queue. Based on internal logic about the workflow (and budget!) the watchdog can launch EC2 resources as soon as the pending list grows beyond a certain size. The same process can also be used to remove resources from Hedeby and terminate single EC2 machine instances or even entire reservations when the cloud resources are no longer needed.

What other optimizations are possible?

Michal has mentioned: I think that it can be handled by creating a special Hedeby's service adapter for EC2. It would contain infinite number of "virtual host resources", that would be shutdown by default When there is a need from GE service, RP will always find EC2 service adapter containing a resource that can match the request - so RP will take (remove) such resource from EC2 service (so the EC2 machine instance is started), and assigned it to GE service. Also, it can work vice versa - EC2 service adapter will ask for unused EC2 resources .. as soon as GE service will mark such resource's usage below a certain level, the resource will be returned back to EC2 service, which will shutdown the EC2 machine instance. The whole thing is pretty similar to our green IT idea.


Cleanly Unregistering Terminated/Stopped AMI server instances One issue that has not been resolved (yet) is that it is somewhat difficult to remove a resource automatically from both Hedeby and Grid Engine if that host is not online and offering functional Hedeby Executor JVM access. This means that Amazon EC2 server systems that are abruptly terminated or stopped may linger in the Hedeby resource pool as unreachable resources in un-fixble ERROR states. A simple solution to this problem would be to create a script that runs at system shutdown (or whenever a runlevel is changed) to automatically unregister from the Hedeby master before the JVMs are stopped and the machine goes offline.


Harshal says: Take the upper limit for the budget from the user (customer) in dollars/per_unit_time . In that way you know, you don't end up creating instances and starting them without any cap on $$. So now you know the cap, say $100/minute. Now depending upon queue we can start powering on AMIs one-by-one and see if that new instance goes to its full throttle, if yes, another instance... and so on till we hit $100/minute (we already know the billing rates of Amazon so that should not be an issue).

Personal tools
Namespaces

Variants
Actions
GridWiki Navigation
Toolbox