Install and configure Grid Engine in heterogenous environment on Linux and Windows with MPICH2

From GridWiki
Jump to: navigation, search

Author: Jacek Strzelczyk <jacek.strzelczyk@gmail.com>

Basic software

  • Linux machines: Fedora Core 3
  • Windows machines: Windows 2000 SP4
  • Microsoft Services For Unix 3.5
  • Grid Engine 6.1u4
  • MPICH2 1.0.7

Developer's software

  • gcc 3.4.4
  • MS Visual C++ 2005 SP1
  • Dev-Cpp 4.9.9.2

Pre-install requirements

Sun Grid Engine (SGE), previously known as CODINE (COmputing in DIstributed Networked Environments) or GRD (Global Resource Director), is an open source batch-queuing system, developed and supported by Sun Microsystems. Sun also sells a commercial product based on SGE, also known as N1 Grid Engine (N1GE).

SGE is typically used on a computer farm or high-performance computing (HPC) cluster and is responsible for accepting, scheduling, dispatching, and managing the remote and distributed execution of large numbers of standalone, parallel or interactive user jobs. It also manages and schedules the allocation of distributed resources such as processors, memory, disk space, and software licenses.

SGE is the foundation of the Sun Grid utility computing system, made available over the Internet in the United States in 2006, later becoming available in many other countries. You have also to read research paper first.

NIS

NIS is a service that provides information, that has to be known throughout the network, to all machines on the network. It can be very helpful in maintaining coherent user structure on all the nodes in grid. Full NIS HOWTO can be found here. For purposes of this installation one user account is needed. Name it 'sgeadmin' and add it to NIS database. Set the $HOME to “/usr/SGE”. File /etc/hosts can be added to NIS.


NFS

Having a common filesystem to install and run SGE is a simple and flexible solution. It can be achieved in many ways, and I'll focus on NFS. The full NFS HOWTO can be found here. The easiest way would be installing NFS server on the machine purposed to be SGE master host. The rest of the hosts will be NFS clients.

NFS on Linux

NFS server

To prepare and share the directory with SGE do:

$ mkdir /usr/SGE
$ echo “/usr/SGE  M1(rw,no_root_squash,async) M2(rw,no_root_squash,async) M3(rw,no_root_squash,async)” >> /etc/exports	     
#	Where M1, M2 and M3 are the names of client hosts (need to be in /etc/hosts).

Restart NFS.

NFS client

$ mkdir /usr/SGE
$ chown sgeadmin /usr/SGE
$ mount -t nfs masternode:/usr/SGE /usr/SGE  #should be added to fstab with suid option

NFS on Windows

To mount network drive in Windows log in as Administrator and type:

>net use X: \\masternode\usr\SGE

To make it automatically at each system boot use AutoExNT (http://support.microsoft.com/kb/243486):

a) Using a text editor (such as Notepad), create a batch file named Autoexnt.bat and include the commands you want to run at startup in this file – that would be

@net use X:\ \\masternode\usr\SGE

b) Copy the Autoexnt.bat file you just created, in addition to the Autoexnt.exe, Servmess.dll, and Instexnt.exe files located in the Resource Kit CD-ROM (or here: http://www.dynawell.com/reskit/microsoft/win2000/autoexnt.zip) to the C:\WINNT\System32 folder on your computer.

c) At the command prompt, type instexnt install, and then press ENTER.

You should then receive the following message:

CreateService AutoExNT SUCCESS with InterActive Flag turned OFF 

This will create AutoExNT service in Windows, that will automatically mount /usr/SGE as X: drive at boot time. To be sure it will happen after all network connections are up add some dependencies in Windows registry. Open registry editor regedt32 (not regedit!). Go to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AutoExNT and add string value named “DependOnService” with value “LanmanWorkstation”.

SGE installation

SGE Linux master host

First, create /usr/SGE/.rhosts file containing all hosts in your SGE installation. Chmod it to 600. Check if rsh works on Linux machines by executing as sgeadmin:

$ rsh otherlinuxhost date

Then, add three lines to /etc/services:

sge_execd	6444/tcp
sge_qmaster	6445/tcp
sge_commd      6446/tcp

Download SGE :

Common Files: http://gridengine.sunsource.net/download/SGE61/sge-6.1-common.tar.gz

Linux files: http://gridengine.sunsource.net/download/SGE61/sge-6.1-bin-lx24-x86.tar.gz

Unpack them:

#su - sgeadmin 
$mv sge-6.1-common.tar.gz /usr/SGE/ 
$mv sge-6.1-bin-lx24-x86.tar.gz /usr/SGE/ 
$cd /usr/SGE/ 
$tar -xvf sge* 

Before starting installation procedure, file util/arch needs to be edited. Change line 248 to:

3*|4*|5*

and then:

$su -
#./install_qmaster

Full installation procedure is described in the SGE Docs: http://docs.sun.com/app/docs/doc/817-6118/emrar?q=N1GE&a=view.

SGE Linux exec hosts

Described in SGE Doc: http://docs.sun.com/app/docs/doc/817-6118/emrar?q=N1GE&a=view.

SGE Windows exec hosts

In short, step by step:

  • Create user 'sgeadmin' locally.
  • Download Services For Unix (SFU) from here.
  • Turn off DEP by adding “/noexecute=alwaysoff” to C:\boot.ini
  • Run SFU installation procedure and add Interix SDK and Interix GNU SDK to default installation.
  • Check if User Mapping daemon is working after installation is complete
  • Go to Menu Start -> Programs -> Windows Services for Unix -> Configuration -> User Name Mapping, choose NIS and Show User Maps. Then connect Unix user sgeadmin with Windows user of the same name.
  • Mount X: drive as /usr/SGE in Interix:
%ls -l /dev/fs    # should show also X
%ln -s /dev/fs/X /usr/SGE
  • Run telnet and rsh from Interix – log in to Windows as Administrator, turn off telnet and rsh daemons from Windows permanently, remove comment marks from rsh and telnet lines in /etc/inetd.conf in Interix, restart inet:
%ps -ef | grep inetd
%kill -1 <PIDofINETD>
  • Check Windows firewall and open ports 23 (telnet) and 514 (shell). Use nmap to check if everything is ok.
  • Add all grid machines to /etc/hosts in Interix
  • Add line “64.235.106.194 ftp.interopsystems.com” to C:\WINNT\system32\drivers\etc\hosts so that ftp can reach this portal (otherwise there are problems in address translation)
  • Install bash:
%pkg_update -L bash
  • Create $HOME/.rhosts file in Interix containing all hosts in your SGE installation
  • Download Windows specific SGE files (sge61u4_addarchs_targz.zip) from Sun's web page (click Download on the left and then choose right file to download). Available after registration.
  • Unpack and copy them to proper directories
  • Run the installation:
%/usr/SGE/install_execd

MPICH2

MPICH2 on Linux

  • Download mpich2-1.0.7.tar.gz archive
  • Configure and install:
$ mkdir /usr/SGE/mpich2
$./configure –prefix=/usr/SGE/mpich2 –with-pm=smpd --with-pmi=smpd
$ make
$ make install
$ cd $HOME
$ echo “phrase=behappy” > .smpd
  • Add /usr/SGE/mpich2/bin to $PATH
  • Check if smpd daemon is working, if not, run it by smpd -s
  • To compile MPI programs:
$ gcc mpi-test.c -ompi-test -I/usr/SGE/mpich2/include -L/usr/SGE/mpich2/lib -lmpich
  • Create credentials file:
$ echo “sgeadmin\n sgeadmin” > /usr/SGE/credentials
$ chmod 600 /usr/SGE/credentials

MPICH2 on Windows

  • Download and install Visual C++ 2005 SP1
  • Download and install MPICH2 for Windows
  • Check if MPICH2 Process Manager daemon is working
  • To compile programs in Windows install Dev-Cpp (or other programming environment)
  • Compile source code in Dev-Cpp with MPICH2 libraries and headers: -I”C:\Program Files\MPICH2\include” -L”C:\Program Files\MPICH2\lib” -lmpi
  • Copy the compiled program into C:\WINNT\system32 (or other directory from Windows $PATH)

Add MPICH2 as parallel environment to SGE

Use qmon from SGE to manually add MPICH2 as parallel environment to SGE. Description of PE can be found here: http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html

Configure Interix

Simple configuration needs to be done, so that Interix will start after network drive with SGE (drive X:) is mounted on Windows machine. Then, SGE can be started automatically by Interix startup script. To do that, add dependency to the Windows Registry:

  • open regedt32
  • Go to: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Interix and add string value named “DependOnService” with value “AutoExNT”.
  • Copy and adjust one of the Interix startup scripts from /etc/init.d to start SGE.
  • Add symbolic links to sge start script in /etc/rc2.d

Then, after Windows machine restart all network connections should be up, network drive X: (with SGE) should be mounted and Interix startup script should start SGE exec daemon. All automatically, with no need of user logging.


Post-install check

Ok, so now you should have:

  • Linux master host with: NIS and NFS servers, SGE master and SGE exec daemons and smpd running.
$ pgrep -l sge
2334 sge_execd
2399 sge_qmaster
2656 sge_schedd
$ pgrep -l smpd
2773 smpd
  • Linux execution hosts with: mounted /usr/SGE from master host, SGE exec and smpd daemons running.
  • Windows execution hosts with: mounted /usr/SGE as network drive X:, smpd daemon running (from Windows version of MPICH2) and Interix with SGE exec daemon.

Test

To test the installation, create simple MPI program (or use example from mpich2/examples) and compile it on both both platforms:

  • on Linux: gcc -I/usr/SGE/mpich2/include -L/usr/SGE/mpich2/lib -lmpi -ompi-test mpi-test.c
  • on Windows: use Dev-Cpp with arguments -I”C:\Program Files\MPICH2\include” -L”C:\Program Files\MPICH2\lib” -lmpi

Then copy output binary files to the one of the directories from $PATH, both on Windows (e.g. C:\WINNT\system32) and Linux (ex. /usr/bin/). Create script (examples in /usr/SGE/examples) that executes mpirun:

mpirun -n $NSLOTS -machinefile $TMPDIR/machines -pwdfile /usr/SGE/credentials mpi-test

It should give you proper results (I hope...)!