Sun Grid Engine Notes

Queue management software for HPC

These are more for myself than anything else, but someone might find them useful as it's difficult to find SGE documentation amid all the LSF stuff too. e.g. qsub is the same for both, but different parameters.

Initial Configuration

With a default installation of SGE, here are the changes made for the cluster in IMAPS;

Fair-share functional policy

This may not be the best, however it is better than FIFO.

Activate the functional share policy by specifying the number of functional share tickets. The value is arbitrary in that any value greater than zero will trigger it. But it needs to be suitably larger so that they can be shared out to users evenly. e.g. if you have 10 users, and each has 100 tickets, then it needs to be at least 10000. So, as root (or SGE admin), edit the config file;

[root@head ~]# qconf -msconf

and set the following;

weight_tickets_functional 1000000

Now we need to assign users some tickets. To do this, edit a different config;

[root@head ~]# qconf -mconf

and set the following;

enforce_user auto auto_user_fshare 100

This gives each user 100 tickets.

NOTE: When I did this I played around a lot to see its behaviour. I found that once a user has submitted a job through SGE without the assignment of 100 tickets (e.g. if you remove that to see what happens), they will never get these assigned after you turn it on. So you need to add the shares for that user manually. (there may be a better way).

[root@head ~]# qconf -muser username

and that will give you something like;

name user oticket 0 fshare 100 delete_time 1358000970 default_project NONE

Setting up memory allocation

This is simple in practice, but it has a couple of issues to be aware of. I did this using the popular h_vmem method. However, I may change this at some point. The reason being is that it assumes that h_vmem is both what we want to use, and what the limit is. This may not be the case. e.g. If you have a job that initialises for a few mins, peaks at 4GB, but then only uses 2G for the next 2 weeks, then it'd be a waste of resources for an eight core machine with 16GB ram (e.g. the nodes I have on the IMAPS machine). For now this will be the case until I can give it some serious thought.

First you need to make sure that h_vmem is a consumable resource;

[root@head ~]# qconf -mc #name shortcut type relop requestable consumable default urgency #---------------------------------------------------------------------------------------- .. h_vmem h_vmem MEMORY <= YES YES 0 0

Now that you've done that, you need to add this resource as a complex to each node like so. (clearly you can write a script to do it to every node).

[root@head ~]# qconf -me node001 hostname node001 load_scaling NONE complex_values h_vmem=16G user_lists NONE xuser_lists NONE projects NONE xprojects NONE usage_scaling NONE report_variables NONE

As you can see, I've added h_vmem=16G to node001. This is the amount of consumable memory that can be allocated.

WARNING: Once you do this, h_vmem HAS to be set on all jobs, otherwise they will fail. To combat this for the forgetful, lets add a default value by editing the sge_request file. So locate and open the file in your editor;

[root@head ~]# which qsub /cvos/shared/apps/sge/6.1/bin/lx26-amd64/qsub [root@head ~]# vi /cvos/shared/apps/sge/6.1/default/common/sge_request

Add to the bottom the following;

# default memory limit -l h_vmem=2G

And this will now give a 2G limit to every job unless otherwise stated.

Issue with IDL and h_vmem

So there is an issue with this method, for some reason IDL won't start even when specifying a very large amount of memory. This is all to do with the h_stack flag. To stop this being an issue, add the following line to the sge_request file;

# default stack size (otherwise IDL and Matlab fail to start) -l h_stack=128m

Clearly if the stack size is not enough for some programs, then users can specify a larger stack using the -l h_stack flag. I did have one user running some perl code that needed 512mb stack space.

Allowing reservations

qconf -msconf

Change max_reservation from 0 to a number, in this case I've chosen 32

Change default_duration from INFINITY to something very long,

Finding that a node isn't accepting jobs

If you discover that a node isn't accepting jobs, here is what it may be;

qstat -f //gives full status of nodes

If you see;

[root@headnode]# qstat -f queuename qtype resv/used/tot. load_avg arch states ---------------------------------------------------------------- queue.q@node01 BIP 0/9/32 12.78 lx26-amd64 queue.q@node02 BIP 0/0/32 0.0 lx26-amd64 d

Then we know that node02 is disabled. To bring it back on we basically re-enable a queue, which goes through and enables all nodes in that queue;

[root@headnode]# qmod -e queue.q Queue queue "queue-node01.q" has been enabled by root@headnode root - queue "queue-node02.q" is already enabled

You can also of course disable a queue;

[root@headnode]# qmod -d queue.q

NB: This needs to be run on your master node as root.

Changing nodes in a queue

OK, this is simple. Just type;

qconf -mq queue.q

This will bring up a vi like editor. Then change the second line;

hostlist node05.cluster node06.cluster node07.cluster

...to whatever nodes you wish to have on it.

Jobs pending

So you notice that there are lots of things in the queue pending. To check what why a job isn't running type;

qacct -j JOBNUMBER

And it will tell you why.

If you see lots of;

queue "all.q" dropped because it is temporarily not available queue "all.q" dropped because it is temporarily not available queue "all.q" dropped because it is temporarily not available

Then type:

qstat -f

This will tell you the status of nodes. If they are in E status it will also tell you why. Usually that a job caused it to stop. e.g.

queue queue.q marked QERROR as result of job 123456's failure at host at node001

You can move them out of error by typing;

qmod -cq all.q

Move queue

If you would like to move a job from one queue to another, you can do this by;

qalter -q all.q 173143

Where all.q is the queue you wish to move it to, and 173143 is the job-id number.

Getting some stats

Basically I wanted to measure the performance of various job submissions with different thread counts. Each run producted an output and error file in the form file_threadnumber.o12345, where 12345 is the job number and threadnumber is the number of threads the program used. This, quite large, one liner, does the following;

Lists the *.o* files
Gets the job number
Gets the job details and searches for vmem, start and end times
Removes the line return so that everything remains on one line
Sorts the whole lot by thread number
Splits up the times into seconds, calculates the time taken and prints everything.

ls -1 *.o* | awk '{split($1,a,"_"); split(a[3],b,"."); split(b[2],c,"o"); printf "Threads "b[1]" "; system("qacct -j "c[2]" | grep \"vmem\\|start_time\\|end_time\" | tr -d \"\\n\""); print ""}' | sort -nk 2 | awk '{split($7,a,":"); split($12,b,":"); print $1" "$2" "(b[1]*60*60+b[2]*60+b[3])-(a[1]*60*60+a[2]*60+a[3])" "$14}'

Probably a bit long winded, but who doesn't like a good one-liner...

Martin Vickers

Researcher and former SysAdmin