Information for current:

Beowulf FAQ

  1. Overview.

  2. Using the beowulf.
  3. Gridengine.

Overview and Administrivia.

This section contains general information about accessing the Beowulves.

How do I get an account?

Beowulf accounts are basic DICE accounts which have additional capabilities, if you do not have a DICE account please request one before or at the same time as asking for a Beowulf account.

Requests for beowulf accounts should initially be submitted through the support request form.

Visitors should obtain beowulf accounts through their sponsors.

What happened to my home directory?

Accounts on the beowulf have home directories local to the cluster, this is mainly to improve the reliability of the cluster. Hardware failure on home directory servers or problems with the intervening network infrastructure can lead to stale NFS mounts which can only be cleared with a lot of intervention from supoprt or by rebooting the nodes. Mounting directories off a server directly connected to the switch also gives a slight performance gain.

Why can't I see <path> on the nodes?

Generally the only network filesystems we make available to the nodes are /home and /group/beowulf mounted on har which is connected to the same switch as most of the nodes. /group/beowulf is also available on any DICE machine. If we provided the full DICE filespace to the nodes and there were jobs running on all 64 the lion nodes reading a 10Gb file then this would generating 640Gb of traffic across EDLAN and impact on other users. The jobs would also take a performance hit, it would be faster to copy the files to har ionce and then pull the files off of har to the individual nodes. If you have jobs doing multiple reads or writes to large files it is much fasted to copy them into /disk/scratch on the nodesthemselves. Finally if an nfs server were to go down or any reason for an extended time any nfs mounts which were written to would go stale. Usually it is only possible to clear stale mounts by rebooting the client machines. Having just one fileserver minimises the risk of this happening.

What if I want to use data shared by group?

There's space in /group/beowulf/scratch which anyone can use, please use it sensibly.

Why isn't there more diskspace on the server?

It was intended to use the server to stage data on and off the cluster, it is not intended for long-term storage and the data is not archived.

What about quotas?

Your beowulf home directory has a quota seperate from your DICE home directory quota.

How can I make my jobs run faster?

Try to identify and avoid bottlenecks:

Using the beowulf

Each cluster has a head node, ssh on to the head node and use the gridengine qsub command to submit your jobs, you can access your standard dice home directory via /nethome/username.

For full information on using gridengine see the official documentation and the local wiki.

why DICE, why not a specialist beowulf OS?

Using DICE workstations means that users can develop applications on their own desktops.

Have you got any information on GPFS?

We have some user docs.

How do I submit large jobs to multiprocessor clusters(townhill and hermes)

Multiprocessor machines share the system memory and it is possible for serveral large jobs to use up all available system memory.

If your jobs require a large memory footprint then you need to explicitly request memory resources when you submit the job to gridengine. The clusters have been set up to track a number of resources including the amount ot free RAM (mem_free) and the amount of free virtual memory (RAM+swap). i A job requiring 500M of free RAM then should be submitted with the -l option specifying mem_free=500M. i.e.

 qsub -l mem_free=2G  /opt/sge/examples/jobs/simple.sh

Job requiring very large memory footprints (where you could only run 1 job in the available system memory should be submitted using the nodelock parallel environment. This will allocate all the slots on the node to your job and prevent othe jobs running on it.

What will happen if I don't use -l or -pe nodelock?

If the node runs out of memory then the kernel will throw an out of memory error and a system known as the OOM killer will come in to play. basically the kernel will start killing off processes until the system stablises itself. In theorshort-lived resource greedy processes will be killed first but it's fairly usual for the system to be left in a more or less unusable state needing a reboot.


Home : Systems : Beowulf 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh