UpgradeToSGE62

From GridWiki
Jump to: navigation, search

Functional Specification Document


Upgrade Sun Grid Engine to version 6.2


Document Version     Comments                         Author                   Date       
0.1                  Initial document                 --Petr 08:21, 17 April 2008 (EDT)
0.2                  Functional Specification         --Petr 05:51, 21 April 2008 (EDT)
0.3                  Reviewed Specification           --Lubos   22 April 2008 (EDT)



Introduction

This project describes the process of upgrading Grid Engine 6.0fcs or later to Sun Grid Engine (SGE) 6.2

Project Overview

Project Aim

The upgrade process is a complex action. A lot of the customers refuse to upgrade to the most recent version because there are certain risks that may result in service unavailability.

The new upgrade process will offer several options, including an option to first test the upgrade on another cluster installation, to give the customer time to solve all possible upgrade related issues while the production cluster is still running. The upgrade to the most recent version will be possible from versions:

  • 6.0FCS or later

Project Benefit

User will now have more options how to perform the upgrade.

New option Clone Cluster Configuration will install a new cluster to a new SGE_ROOT and will clone the configuration of the old cluster. This way the cluster can be tested before the cluster starts to be used.

Project Duration

This upgrade project should be finished by 30th May 08

Project Dependencies

Available         Supplier       Product/Project/Interface  Dependency

Upgrade process

Requirements

  • Support upgrade for 6.0fcs or later
  • also support TACC's distribution

Constraints

  • Dynamic and static load values will be lost (static load values will be recreated after startup a new cluster)
  • Cannot replicate jobs or advanced reservations to upgraded cluster
  • There may be running and pending jobs during upgrade when the configuration is saved. If the admin decides to install the new SGE version in the same SGE_ROOT directory the old cluster must be drained from all the jobs before it is shutdown and replaced with the SGE 6.2 distribution.

Upgrade types

There are two main upgrade types.

  • Clone cluster configuration
  • Upgrade cluster

Clone Cluster Configuration

The clone of configuration will save the current configuration from the running cluster to a set of configuration files. Then it will copy the configuration files to an upgrade location and upgrade them if needed.

Additional Requirements

  • Upgrade to second cluster while the production cluster is running.
  • Ability to test the upgrade process

Additional Constraints

  • new SGE_ROOT, SGE_QMASTER_PORT, ... needed (new installation)
  • BDB server needs to be on a different host

Upgrade Cluster

Additional Requirements

  • No additional resources are required.
  • No special release patch needed

Additional constrains

  • No jobs can be running
  • Shutdown the cluster
  • Special script for configuration backup bundled with upgrade DVD, or downloaded from website

Overall Block Diagram

Clone Cluster Configuration Block Diagram

  • There will be two clusters. One is current Production Cluster and another is a new Test Cluster
  • The Production Cluster is on the left (SGE v6.x)
  • The Test Cluster is on the right (SGE v6.2)
  • Both Clusters coexists at same time
  • The upgrade process by Clone Cluster Configuration will configure the Test Cluster according to the Production Cluster configuration
  • You can change Test Cluster to Production Cluster later (OPTIONAL) or run the upgrade on the Production Cluster, once you know upgrade will succeed.
  • ARCO must be reinstalled according the ARCO manual. There is no ARCO upgrade script.
SGE v6.x    ---        CCC         ---->  SGE v6.2
  \            \                  /         \ 
  RF            \--- BCF-->UCF --/          RF
   \                                         \ 
   DBW                                     DBW v6.2
     \                                         \ 
     DB                                      DB v6.2
      \                                         \
      ARCO                                     ARCO v6.2


Legend

  • SGE v6.x Sun Grid Engine 6.0FSC+ version
  • CCC ... Clone Cluster Configuration
  • BCF ... Backup configuration files
  • UCF ... Upgraded configuration files
  • RF ... Reporting file
  • DBW ... DB Writer
  • DB ... Database
  • ARCO ... Accounting Reporting Console

Upgrade Cluster Block Diagram With a Sense of Time (time progresses from up to down)

  • There is a production cluster
  • The upgrade script saves the production cluster configuration
  • The production cluster must be stopped and uninstalled (not the DBwriter)
  • The new SGE v6.2 files are installed installed for the production cluster
  • The upgrade scripts in new cluster will upgrade the saved configuration files
  • The upgrade scripts in new cluster will load upgraded configuration files
  • ARCO must be reinstalled according the ARCO manual


SGE v6.x  ---> BCF
 \               \ 
  RF              BCF
   \                \ 
   DBW              UCF  ----------> SGE v6.2
     \                                 \ 
      \                                RF
       \                                 \   
        \                                DBW v 6.2   
         \                                 \ 
         DB    ----- ARCO upgrade --        \
           \                        \        \
          ARCO                       \        \
                                      \        \
                                       \        \
                                        \-----> DB v6.2
                                                  \
                                                  ARCO v6.2

Functional Definition

Operations

There are to main types of the upgrade to the SGE 6.2 version

Operations for Clone Cluster Configuration Upgrade Type

  1. Install new binaries for the new 6.2 test cluster
  2. Start upgrade procedure on test cluster
    1. Create default configuration on test cluster
    2. Save configuration of the production cluster using the new backup script
    3. Ask new questions:
      1. Cluster name
      2. Enable SMF support on Solaris
      3. Enable Sun ServiceTags
      4. Enable IJS (interactive jobs support)
      5. Enable other features of the 6.2 version.
    4. Upgrade the saved configuration to 6.2 format
      1. Some default values have changed
      2. Jobs and advanced reservations will be lost
    5. Load upgraded configuration to 6.2 test cluster
  3. Upgrade procedure finished
  4. Start the test cluster
  5. Test the test cluster
  6. Stop production cluster, or let the job finish(OPTIONAL)
  7. Switch test cluster to new production cluster(OPTIONAL)

Operations for Upgrade Cluster Upgrade Type

  1. Do backup of the production cluster configuration and spool database (inst_sge -bup, etc.)
  2. Save configuration of the production cluster using new backup script (install new backup script and run it)
    1. Run save operation of backup script bundled on DVD or download it.
  3. Stop and remove cluster
  4. Install new binaries (test old vs. new packages co-existance)
  5. Start upgrade procedure
    1. Create default configuration on the cluster
    2. Ask questions:
      1. Cluster name
      2. Enable SMF support on Solaris
      3. Enable Sun ServiceTags
      4. Enable IJS (interactive jobs support)
      5. Enable other features of the 6.2 version.
    3. Upgrade the saved configuration to 6.2 format
      1. Some default values have changed
      2. Jobs and advanced reservations will be lost
    4. Load upgraded configuration cluster
  6. Upgrade procedure finished

Reliability, Availability, Serviceability (RAS)

  1. Do old config and spool database (inst_sge -bup, etc.)
  2. Save configuration (install new backup script and run it), host_aliases, qtask, accounting, etc (-bup stuff from common dir)
  3. (single SGE_ROOT) Stop and remove cluster (uninstall old RC-scripts, SMF?, JobIds start with 1, last number or last number + offset, same for ARs, )
  4. Install new binaries (test old vs new packages coexistance)
  5. Start upgrade procedure (ask questions like cluster name, SMF, ServiceTags, IJS, etc.)
  6. Upgrade the configuration to 6.2 format
    1. Some default values have changed (use original user defined if exists and provide defaults for missing fields, document what are new recommended values in upgrade procedure)
    2. Jobs and advanced reservations will be lost
    3. Reporting data (make sure old reporting file is completely processed by the old cluster, Dbwriter needs to be upgraded before we start new execds, we need good documentation!)
      1. New database for the new cluster
      2. Drain old reporting file, upgrade database and switch to new cluster
    4. Need to remove load_values from execd configuration (import fails otherwise)
    5. New qmaster <=> old execd -> execd dump
    6. IJS - add new global defaults, remove pre host values (Document that if IJS is chosen local conf will be lost!)
  7. Create default configuration (recreate bootstrap first)
  8. Load configuration to 6.2+ cluster (start needed services)
  9. Remove local spool directories for 2 clusters

User Experience

The testing ability of the upgrade process will improve the user experience.

Manufacturing

Quality Assurance

Security & Privacy

Migration Path

Documentation

Installation

Packaging

Issues/Risks and Proposed Mitigation

| *Category* | *Risk* | *Impact (L/M/H)* | *Probability (L/M/H)* | *Mitigation Plan*| *Owner* | ||||||| |||||||

Component Descriptions

Component SGE v6.0.x

Overview

Functionality

Interfaces

Other Requirements

Component SGE v6.1.x

Overview

Functionality

Interfaces

Other Requirements

Component SGE v6.2

Overview

Functionality

Interfaces

Other Requirements

Component DB Writer

Overview

Functionality

Interfaces

Other Requirements

Component Reporting

Overview

Functionality

Interfaces

Other Requirements

Appendix

Appendix #1 Name