GridEngine (SGE) is an open source batch-queuing system, supported by Sun Microsystems, which manages and schedules the allocation of distributed resources such as processors, memory, disk-space, and software licenses. Gridengine is responsible for accepting, scheduling, dispatching and managing the remote execution of large numbers of standalone, parallel or interactive user jobs
Full documentation on using the version of gridengine installed can be found on the SunSource website. These are some basic notes to get people up and running. You should also look at and contribute to the wiki.
You can only access sge from the various head nodes and there is currently one queue set up per cluster.
Each cluster is configured with one default queue and two Parallel Environments configured (make and lammpi). The lammpi Environment will generate a set of temporary ssh keys, fire up ssh-agent and install the keys into it, this will allow ssh access within the allocated nodes without needing to use kerberos credentials. The environment will lamboot a LAM multicomputer running over the allocates nodes, the user should only be required to run their mpi job using mpirun. When the job finishes the multicomputer is halted, the ssh agent will be killed and the keys destroyed.
If you require additional environments added please submit an RT ticket detailing your requirements.
[bw530n01]iainr: qhost HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - bw530n01 lx24-x86 1 0.02 2.0G 458.0M 1.0G 38.1M bw530n02 lx24-x86 1 0.00 2.0G 299.1M 1.0G 33.7M bw530n03 lx24-x86 1 0.00 2.0G 369.0M 1.0G 39.7M bw530n04 lx24-x86 1 1.01 2.0G 692.0M 1.0G 53.1M bw530n05 lx24-x86 1 1.00 2.0G 686.0M 1.0G 42.2M bw530n06 lx24-x86 1 1.00 2.0G 684.1M 1.0G 45.0M bw530n07 lx24-x86 1 1.00 2.0G 693.5M 1.0G 44.6M bw530n08 lx24-x86 1 1.06 2.0G 686.8M 1.0G 43.2M bw530n09 lx24-x86 1 1.00 2.0G 691.8M 1.0G 41.1M bw530n10 lx24-x86 1 1.03 2.0G 679.1M 1.0G 43.8M bw530n11 lx24-x86 1 1.06 2.0G 684.1M 1.0G 47.8M bw530n12 lx24-x86 1 1.01 2.0G 679.2M 1.0G 48.2M bw530n13 lx24-x86 1 1.00 2.0G 680.7M 1.0G 48.5M bw530n14 lx24-x86 1 1.04 2.0G 677.2M 1.0G 65.7M bw530n15 lx24-x86 1 1.01 2.0G 681.5M 1.0G 42.3M bw530n16 lx24-x86 1 1.00 2.0G 689.1M 1.0G 43.2M lutzow lx24-x86 1 1.66 2.0G 245.7M 1.0G 44.5MIf you can't at least run qhost and get something like the above then something is wrong.
the qsub command is used to submit simple jobs (runs on one node)
[bw530n01]iainr: qsub tmp
Your job 194 ("tmp") has been submitted.
[bw530n01]iainr:
Output is written to <jobname>.o<jobnumber> and standard error is written to <jobname>.e<jobnumber> so the above jobs submission would produce two files tmp.o194 and tmp.e194
In order to submit lam mpi jobs you have to select the correct parallel environment (lammpi) using the -pe option and specify the number of nodes you require and the batch script to run. Gridengine will set up a multicomputer on the appropriate nodes e.g. for the batch script runme:
#!/bin/sh mpirun -v myappWith myapp being the lam schema file
h /home/iainr/master C -s h /home/iainr/slavethis will generate the following files:
runme.e198 (prolog error file)
Empty
runme.o198 (prolog output file)
1878 /home/iainr/master running on local 1879 /home/iainr/slave running on n0 (o) 13909 /home/iainr/slave running on n1 11701 /home/iainr/slave running on n2 master: allocating block (0, 0) - (19, 19) to process 1 master: allocating block (20, 0) - (39, 19) to process 2 master: allocating block (40, 0) - (59, 19) to process 3 master: allocating block (60, 0) - (79, 19) to process 1 master: allocating block (80, 0) - (99, 19) to process 2 master: allocating block (100, 0) - (119, 19) to process 3 master: allocating block (120, 0) - (139, 19) to process 1 master: allocating block (140, 0) - (159, 19) to process 2 master: allocating block (160, 0) - (179, 19) to process 3 master: allocating block (180, 0) - (199, 19) to process 1 master: allocating block (200, 0) - (219, 19) to process 2 master: allocating block (220, 0) - (239, 19) to process 3 master: allocating block (240, 0) - (259, 19) to process 1 master: allocating block (260, 0) - (279, 19) to process 2 master: allocating block (280, 0) - (299, 19) to process 3 master: allocating block (300, 0) - (319, 19) to process 1 master: allocating block (320, 0) - (339, 19) to process 2 ... master: allocating block (360, 500) - (379, 511) to process 2 master: allocating block (380, 500) - (399, 511) to process 3 master: allocating block (400, 500) - (419, 511) to process 1 master: allocating block (420, 500) - (439, 511) to process 2 master: allocating block (440, 500) - (459, 511) to process 3 master: allocating block (460, 500) - (479, 511) to process 1 master: allocating block (480, 500) - (499, 511) to process 2 master: allocating block (500, 500) - (511, 511) to process 3 master: done.
runme.pe198 (PE error file)
lamboot: attempting to execute "/usr/bin/ssh -x -a bw530n11.inf.ed.ac.uk -n echo $SHELL" lamboot: got remote shell /bin/bash lamboot: attempting to execute "/usr/bin/ssh -x -a bw530n11.inf.ed.ac.uk -n hboot -t -c lam-conf.lam -d -s -I "-H 129.215.18.73 -P 60259 -n 1 -o 0 "" lamboot: attempting to execute "/usr/bin/ssh -x -a bw530n12.inf.ed.ac.uk -n echo $SHELL" lamboot: got remote shell /bin/bash lamboot: attempting to execute "/usr/bin/ssh -x -a bw530n12.inf.ed.ac.uk -n hboot -t -c lam-conf.lam -d -s -I "-H 129.215.18.73 -P 60259 -n 2 -o 0 ""
runme.po198 (PE output file)
Starting sge-lam created directory /tmp/keys.WanWoo: Enter passphrase for /tmp/keys.WanWoo/tmpid: Identity added: /tmp/keys.WanWoo/tmpid (/tmp/keys.WanWoo/tmpid) hboot: process schema = "/etc/lam/lam-conf.lam" hboot: found /usr/bin/lamd hboot: performing tkill hboot: tkill hboot: booting... hboot: fork /usr/bin/lamd [1] 1828 lamd -H 129.215.18.73 -P 60259 -n 0 -o 0 -d hboot: process schema = "/etc/lam/lam-conf.lam" LAM 6.5.8/MPI 2 C++/ROMIO - Indiana University lamboot: boot schema file: /tmp/198.1.all.q/lamhostfile lamboot: opening hostfile /tmp/198.1.all.q/lamhostfile lamboot: found the following hosts: lamboot: n0 bw530n07.inf.ed.ac.uk lamboot: n1 bw530n11.inf.ed.ac.uk lamboot: n2 bw530n12.inf.ed.ac.uk lamboot: resolved hosts: lamboot: n0 bw530n07.inf.ed.ac.uk --> 129.215.18.73 lamboot: n1 bw530n11.inf.ed.ac.uk --> 129.215.18.77 lamboot: n2 bw530n12.inf.ed.ac.uk --> 129.215.18.78 lamboot: found 3 host node(s) lamboot: origin node is 0 (bw530n07.inf.ed.ac.uk) lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -s -I " -H 129.215.18.73 -P 60259 -n 0 -o 0 "" hboot: found /usr/bin/lamd hboot: performing tkill hboot: tkill hboot: booting... hboot: fork /usr/bin/lamd [1] 13895 lamd -H 129.215.18.73 -P 60259 -n 1 -o 0 -d hboot: process schema = "/etc/lam/lam-conf.lam" hboot: found /usr/bin/lamd hboot: performing tkill hboot: tkill hboot: booting... hboot: fork /usr/bin/lamd [1] 11692 lamd -H 129.215.18.73 -P 60259 -n 2 -o 0 -d lamboot completed successfully LAM 6.5.8/MPI 2 C++/ROMIO - Indiana University Shutting down LAM lamhalt: sending HALT to n1 (bw530n11.inf.ed.ac.uk) lamhalt: sending HALT to n2 (bw530n12.inf.ed.ac.uk) lamhalt: waiting for HALT ACKs from remote LAM daemons lamhalt: received HALT ACK from n1 (bw530n11.inf.ed.ac.uk) lamhalt: received HALT ACK from n2 (bw530n12.inf.ed.ac.uk) lamhalt: sending final HALT to n0 (bw530n07.inf.ed.ac.uk) lamhalt: local LAM daemon halted LAM halted 1824 is deceased. keydir is /tmp/keys.WanWoo unlink /tmp/keys.WanWoo/tmpid unlink /tmp/keys.WanWoo/tmpid.pub rmdir /tmp/keys.WanWoo
Gridengine supports the use of interactive jobs using the qrsh and qlogin commands, the qsh command will not work in the current configuration however this is redundant as X applications can be run from qrsh or qlogin sessions.
Both qrsh and qlogin can be used to setup interactive parallel environments using -pe as above however your should bear in mind that it may take several minutes or longer to allocate hosts and set up the environment. In the case of qlogin you may find spurious error messages being displayed, as long as it claims to be still scheduling the job please ignore these.
Typical qlogin session
[bw530n01]iainr: qlogin -pe lammpi 4 waiting for interactive job to be scheduled ...timeout (3 s) expired while waiting on socket fd 4 Your interactive job 210 has been successfully scheduled. timeout (4 s) expired while waiting on socket fd 4 Your interactive job 210 has been successfully scheduled. timeout (4 s) expired while waiting on socket fd 4 Your interactive job 210 has been successfully scheduled. timeout (5 s) expired while waiting on socket fd 4 Your interactive job 210 has been successfully scheduled. timeout (4 s) expired while waiting on socket fd 4 Your interactive job 210 has been successfully scheduled. timeout (3 s) expired while waiting on socket fd 4 Your interactive job 210 has been successfully scheduled. timeout (5 s) expired while waiting on socket fd 4 Your interactive job 210 has been successfully scheduled. timeout (4 s) expired while waiting on socket fd 4 Your interactive job 210 has been successfully scheduled. timeout (4 s) expired while waiting on socket fd 4 Your interactive job 210 has been successfully scheduled. timeout (3 s) expired while waiting on socket fd 4 Your interactive job 210 has been successfully scheduled. timeout (5 s) expired while waiting on socket fd 4 Your interactive job 210 has been successfully scheduled. timeout (5 s) expired while waiting on socket fd 4 Your interactive job 210 has been successfully scheduled. timeout (3 s) expired while waiting on socket fd 4 Your interactive job 210 has been successfully scheduled. timeout (3 s) expired while waiting on socket fd 4 Your interactive job 210 has been successfully scheduled. timeout (5 s) expired while waiting on socket fd 4 Your interactive job 210 has been successfully scheduled. Your interactive job 210 has been successfully scheduled. Establishing /opt/sge/bin/lx24-x86/sshscript session to host bw530n07.inf.ed.ac.uk ... ssh -p 32795 bw530n07.inf.ed.ac.uk [bw530n07]iainr:
|
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk Please contact our webadmin with any comments or corrections. Logging and Cookies Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh |