Notes on my adventures with another beastly overblown job dispatching system.

Miscellaneous Ominous Quotes

"Because of the flexibility offered by the Sun Grid Engine software, administrators sometimes find themselves in need of some help getting started."

"Job scheduling with the Sun Grid Engine software is a very large topic. The Sun Grid Engine software provides a wide variety of scheduling policies with a great degree of flexibility. What is presented above only scratches the surface of what can be done with the Sun Grid Engine 6.2 software."

"The Sun Grid Engine software provides administrators with a tremendous amount of flexibility in configuring the cluster to meet their needs. Unfortunately, with so much flexibility, it sometimes can be challenging to know just what configuration options are best for any given set of needs."

Installing

Look for a file like this:

sge62u5_linux24-x64_rpm.zip

Not like this (this is ancient Itanium crap):

sge62u5_linux24-ia64_rpm.zip

Which makes a directory like this:

sge6_2u5

Containing these rpms:

sun-sge-bin-linux24-x64-6.2-5.x86_64.rpm
sun-sge-common-6.2-5.noarch.rpm

Unfortunately, there are a lot of dependencies.

    $ rpm -qpR sun-sge-bin-linux24-x64-6.2-5.x86_64.rpm
        error: Failed dependencies:
                libXm.so.3()(64bit) is needed by sun-sge-bin-linux24-x64-6.2-5.x86_64
                sun-sge-common = 6.2 is needed by sun-sge-bin-linux24-x64-6.2-5.x86_64

The second one is easy. Install sun-sge-common first:

$ sudo rpm -ivh sun-sge-common-6.2-5.noarch.rpm

To fix the first dependency, libXm, you need to install this:

$ sudo yum install openmotif

Which installs these (among other things):

    /usr/lib64/libXm.so.4
    /usr/lib64/libXm.so.4.0.1
    /usr/lib/libXm.so.4
    /usr/lib/libXm.so.4.0.1

I went ahead and put links in for the dependency like this:

$ sudo ln -s /usr/lib64/libXm.so.4.0.1 /usr/lib64/libXm.so.3
$ sudo ln -s /usr/lib64/libXm.so.4.0.1 /lib64/libXm.so.3

Turns out that actually didn’t work. It still complained about missing libXm.so.3 even with it present. But since it really is there I just installed ignoring the problem:

$ sudo rpm -ivh --nodeps sun-sge-bin-linux24-x64-6.2-5.x86_64.rpm

And the binaries seem there and some of them even execute.

The binaries, libraries, and man pages are all neatly not cluttering up your system in the SGE_ROOT directory. That’s nice, but not super useful. You could have everyone change their paths, but here’s a script that links everything to a better place and unlinks things if you decide you don’t want to do that:

#!/bin/bash
# Chris X Edwards
# A script to make symlinks for all the important things in the
# SGE_ROOT directory to where it's easily usable on the system.
# This elaborate script was made so that it's easy to *uninstall*
# everything in the event of something changing.

SGE_ROOT=/gridware/sge
MAN_ROOT=/usr/local/share/man
BIN_ROOT=/usr/local/bin
LIB_ROOT=/usr/local/lib

# LINK THE MAN PAGES
MEN="1 3 5 8"
for M in $MEN
do
    for S in ${SGE_ROOT}/man/man${M}/*
    do
        TDIR="${MAN_ROOT}/man${M}"
        # TO INSTALL
        ln -sf ${S} ${TDIR}/
        # TO UNINSTALL
        #echo "rm -f ${TDIR}/$(basename ${S})"
    done
done

# LINK THE BINARIES
for S in ${SGE_ROOT}/bin/lx24-amd64/*
do
    # TO INSTALL
    ln -sf ${S} ${BIN_ROOT}/
    # TO UNINSTALL
    #echo "rm -f ${BIN_ROOT}/$(basename ${S})"
done

# LINK THE LIBRARIES  (Maybe not necessary, but why not?)
for S in ${SGE_ROOT}/lib/lx24-amd64/*
do
    # TO INSTALL
    ln -sf ${S} ${LIB_ROOT}/
    # TO UNINSTALL
    #echo "rm -f ${LIB_ROOT}/$(basename ${S})"
done

Configuring

Fist thing is to add a non priviledged user for SGE to fall back to while running (starts as root):

$ sudo useradd -d $SGE_ROOT -c "Sun Grid Engine User" -u 222 -s /sbin/nologin sgeadmin

I used uid 222 which may or may not be a good idea. Next do this on all compute nodes. This entails making a /gridware/sge directory first. I had a ton of trouble running this over ssh and eventually had to make a little one line script on a shared directory and run it that way:

    :-> [blue.xed.ch][~]$ cat sgeuseraddscript
    #!/bin/bash
    /usr/sbin/useradd --home-dir /gridware/sge --comment "Sun Grid Engine User" --uid 222 --shell /sbin/nologin sgeadmin
    [root@blue ~]#  for X in `seq -w 1 48`; do echo $X; ssh blue$X ~xed/sgeuseraddscript ; done

I have no idea what I was doing wrong, but when I just put the command itself as the last arguments to ssh, it gave a useradd error (even with explicit dir and variables, good quoting, etc. So I just kept a little script that was in a common place and kept modifying it to do each command. Now run this one to set up dirs:

    echo "192.168.1.25:/gridware/sge /gridware/sge nfs async 0 0" >> /etc/fstab
    mount /gridware/sge

Looks like you should edit /etc/services to put in the standard SGE ports:

    [root@blue sge]# grep sge /etc/services
    sge_qmaster     536/tcp      # for Sun Grid Engine (SGE) qmaster daemon
    sge_execd       537/tcp      # for Sun Grid Engine (SGE) exec daemon

Interesting. Now it looks like this is there by default. Make sure there’s no conflict between old ports used and new.

# grep sge_ /etc/services
sge_qmaster     6444/tcp                # Grid Engine Qmaster Service
sge_qmaster     6444/udp                # Grid Engine Qmaster Service
sge_execd       6445/tcp                # Grid Engine Execution Service
sge_execd       6445/udp                # Grid Engine Execution Service

I just edited it by hand and then copied it to the nodes.

[root@blue sge]#  for X in `seq -w 1 48`; do echo $X; scp /etc/services blue$X:/etc/ ; done

Here’s a better way probably:

sed -i 's~^opalis-rdv.*536/tcp.*$~sge_qmaster     536/tcp  # Sun Grid Engine~' /etc/services
sed -i 's~^nmsp.*537/tcp.*$~sge_execd       537/tcp  # Sun Grid Engine~' /etc/services

I made a copy of /gridware/sge/util/install_modules/inst_template.conf to my own dir and edited its contents. It looks a bit like:

    SGE_ROOT="/gridware/sge/"
    SGE_QMASTER_PORT="536"
    SGE_EXECD_PORT="537"
    SGE_ENABLE_SMF="false"
    SGE_CLUSTER_NAME="blue"
    SGE_JVM_LIB_PATH="Please enter absolute path of libjvm.so"
    SGE_ADDITIONAL_JVM_ARGS="-Xmx256m"
    CELL_NAME="default"
    ADMIN_USER="sgeadmin"
    QMASTER_SPOOL_DIR="/gridware/sge/default/spool/qmaster"
    EXECD_SPOOL_DIR="/gridware/sge/default/spool/execd"
    GID_RANGE="20000-21000"
    SPOOLING_METHOD="classic"
    DB_SPOOLING_SERVER="none"
    DB_SPOOLING_DIR="/gridware/sge/default/spooldb"
    PAR_EXECD_INST_COUNT="20"
    ADMIN_HOST_LIST="blue25"
    SUBMIT_HOST_LIST="blue25"
    EXEC_HOST_LIST="blue01 blue02 blue03 blue04 blue05 blue06 blue07 blue08 blue09 blue10 blue11 blue12 blue13 blue14 blue15 blue16 blue17 blue18 blue19 blue20 blue21 blue22 blue23 blue24 blue25 blue26 blue27 blue28 blue29 blue30 blue31 blue32 blue33 blue34 blue35 blue36 blue37 blue38 blue39 blue40 blue41 blue42 blue43 blue44 blue45 blue46 blue47 blue48"
    HOSTNAME_RESOLVING="true"
    SHELL_NAME="ssh"
    COPY_COMMAND="scp"
    DEFAULT_DOMAIN="none"
    ADMIN_MAIL="admin-ablab@ucsd.edu"
    ADD_TO_RC="false"
    SET_FILE_PERMS="true"
    RESCHEDULE_JOBS="wait"
    SCHEDD_CONF="1"
    WINDOWS_SUPPORT="false"

Set the SGE_ROOT:

[root@blue ~]# export SGE_ROOT=/gridware/sge/

Fix the ownership:

# chown -R  sgeadmin:sgeadmin /gridware

Now run the big install script:

./inst_sge -m -x -auto /home/xed/headnode_sge.conf

Hopefully that went well. If so, the good news should be reported in a new directory called:

/gridware/sge/default/common/install_logs

And you can check that the daemons are running with:

# ssh blue13 ps -ef | grep sge
sgeadmin 17623     1  0 21:27 ?        00:00:00 /gridware/sge//bin/lx24-amd64/sge_execd

Looks like it’s smart to do this:

# /gridware/sge/bin/lx24-amd64/qconf -mconf global

And edit this line (to include bash, of course):

login_shells                 bash,sh,ksh,csh,tcsh

Reconfiguring

What’s really weird about this is that this configuration file gets read when the install script runs. Other than that, I have no idea how to modify the configuration information of an functioning system. Let’s say you add some nodes or change the host name of the head node. How do you let the running system know this. So far, all I can figure out to do is to edit the config script and run the install again. It will complain that there is already a …./default directory. It seems like it’s safe to delete this directory if nothing is queued and run the install script to create another one. It also might be a good idea to kill -9 any sge processes on the node you’re reinstalling on. Hard to believe there isn’t a better way, but I haven’t found it yet and god forbid that it just use normal Linux/Unix conventions.

Adding New Nodes

I’m not sure this works, but I’m trying this technique that I found somewhere:

    :-> [headnode.xed.ch][~]$ sudo SGE_ROOT=/gridware/sge qconf -ae
       <this is like visudo and you have to replace:
        s/template/c51/ where c51 is the new host>
    added host c51 to exec host list
    :-> [headnode.xed.ch][~]$ sudo SGE_ROOT=/gridware/sge qconf -ah c51
    c51 added to administrative host list
    :-> [headnode.xed.ch][~]$ sudo SGE_ROOT=/gridware/sge qconf -mq all.q
       <Here you're supposed to add the hosts to:
       "You will add the hostnames to the list in hostlist (I added it
       like: "@allhosts, c50") and the
       number of CPUs that each node has for SGE to the list in slots
       (using the format [HOSTNAME=NUM_OF_CPUs]).

Does this work? Don’t know. Ok, here’s a better way: (Taken from http://ait.web.psi.ch/services/linux/hpc/merlin3/sge/admin/sge_hosts.html)

    # Make a config dir if it doesn't exist:
    :-> [headnode.xed.ch][~]$ sudo mkdir $SGE_ROOT/config
    # Add to trusted host list
    :-> [headnode.xed.ch][~]$ sudo -i qconf -ah c50
    c59 added to administrative host list
    # Add to list of hosts allowed to submit jobs:
    :-> [headnode.xed.ch][~]$ sudo -i qconf -as c50
    # Add execution host to SGE cluster:
    c50 added to submit host list

    :-< [headnode.xed.ch][~/testy]$ for H in `seq 50 59`; do H=c$H; echo $H; sudo cat <<XXX > $H.conf
    > hostname $H
    > load_scaling NONE
    > complex_values slots=4
    > user_lists NONE
    > xuser_lists NONE
    > projects NONE
    > xprojects NONE
    > usage_scaling NONE
    > report_variables NONE
    > XXX
    > done
    c50
    c51
    c52
    c53
    c54
    c55
    c56
    c57
    c58
    c59
    :-> [headnode.xed.ch][~/testy]$ sudo cp *conf /gridware/sge/config/


    :-> [headnode.xed.ch][~]$ sudo -i qconf -Ae $SGE_ROOT/config/c50.conf
    root@c25 added "c50" to exechost list
    # A rough check
    :-> [headnode.xed.ch][~]$ sudo -i qconf -sh | grep c50

    # Make sure these directories are available:
    :-> [headnode.xed.ch][~]$ for N in `seq 50 59`; do sudo -i mkdir $SGE_ROOT/default/spool/execd/c$N; done
    :-> [headnode.xed.ch][~]$ for N in `seq 50 59`; do sudo -i chown sgeadmin:sgeadmin $SGE_ROOT/default/spool/execd/c$N; done

    # Make sure that this command runs from the exec node:
    :-> [blue50][~]$ qconf -sh | grep c50
    c50
Tip
Any weird comm/hostname resolution stuff might be helped if the order is carefully set in the /etc/hosts file. Basically make the hostname that SGE knows about first in the list. Very annoying.
# Prep the install configuraiton file:
headnode.xed.ch][/gridware/sge/util/install_modules]$ sudo cp inst_template.conf inst_ab-lab.conf

IMPORTANT PART THAT ACTUALLY WORKED!

I don’t even know what or how much of the above stuff was necessary. I think this is still needed:

:-> [headnode.xed.ch][~]# for C in 64 65 67 68 69; do echo $C; qconf -ah c$C; qconf -as c$C; done

And then finally I was able to get it working by manually answering questions:

[headnode.xed.ch][/gridware/sge/util]$ ssh root@c51
[root@blue51 sge]# cd $SGE_ROOT; yes '' | ./inst_sge -x

Stupid but it works.

Administering

RESTARTING AFTER A POWER CYCLE

God forbid that Sunhh^hOracle actually make a decent init script that can be used sensibly on a Linux cluster.

Master Node

Double check that /gridware is being exported.

I have no idea how to get the service running so that it survives a reboot. Here’s what I did when it stopped working after the last unexpected power cycle:

$ sudo SGE_ROOT=/gridware/sge $SGE_ROOT/bin/lx24-amd64/sge_qmaster

Compute Nodes

Double check that /gridware is mounted.

sudo /root/nodessh /etc/init.d/sgeexecd.c start

Or log in and start them individually. When you’re done, check with qhost and make sure there aren’t dashes in $4,$6, and $8. Dashes means not participating.

If you want to disable a compute node so that no jobs are sent to it do something like this:

[c][~]$ sudo SGE_ROOT=/gridware/sge/ qconf -de c25
root@c25 removed "c25" from execution host list

In this case c25 was my master node and when jobs were submitted to it, bad things happened.

Commands

The following commands are central to Sun Grid Engine administration:

qconf - Add, delete, and modify the current Grid Engine configuration. For more information, see Using qconf.

qhost - View current status of the available Grid Engine hosts, the queues, and the jobs associated with the queues. For more information, see the qhost(1) man page.

qalter and qsub - Submit jobs. For more information, see the submit(1) man page.

qstat - Show the status of Grid Engine jobs and queues. For more information, see the qstat(1) man page. Note that qstat doesn’t necessarily show all jobs. There was a case where some hung jobs were only visible if you specified the user:

qstat -u xed

To see all users:

qstat -u '*'

But not:

qstat -u'*'

Some other good things to know:

qstat -t -u '*'

This shows which execution nodes the job is running on.

qstat -s prsh -u $USER

Shows jobs that are pending, running, suspended, or holding. Can use any combination of these.

qstat -ext

Shows "extended" information. Shows the cpu time which could be interesting to find heavy cpu users.

How do you know if some nodes are not working? This is how:

qstat -f | grep au

Log in to them and do a sudo /etc/init.d/sgeexecd.c stop followed by a sudo /etc/init.d/sgeexecd.c start. A restart option would make too much sense!

Go figure.

Another thing is queue error states which basically shut the node down. This can be investigated with something like:

qstat -f -explain E
qstat -j $JOBID -explain E
qacct -j $JOBID

These error states can be cleared with:

sudo SGE_ROOT=/gridware/sge qmod -c '*'

qdel - Note that for stuck jobs, this needs more firepower. Use the force option to kill them:

qdel -f JOBIDNUMBER

Or in the event of a complete clusterf… try this:

sudo SGE_ROOT=/gridware/sge/ qdel -f -u "*"

This really should kill all jobs.

qquota - List each resource quota that is being used at least once or that defines a static limit. For more information, see the qquota(1) man page.

Disable Nodes

If nodes are acting badly they can cause jobs to fail. While sorting out any problems it can be best to disable the nodes. This is done with:

sudo SGE_ROOT=/gridware/sge qmod -d all.q@c22

In this command the "all" (maybe "all.q") is the name of the queue. To re-enable the node use the same command but with the option -e.

Use qstat -f to see which nodes have the disabled state.

Running

The SGE_ROOT variable seems important. In our case it should be:

    # bash
    export SGE_ROOT=/gridware/sge
    # tcsh
    setenv SGE_ROOT /gridware/sge

Basically I did this as a test:

$ for J in `seq 1 200`; do qsub testjob ; done

Where testjob is:

    #!/bin/bash
    #$ -S /bin/bash
    #$ -j y -o /tmp/xed
    FILE="/cfs/xed/queuetestdir/results"
    T=$(( 5+(`od -An -N2 -i /dev/urandom` )%(20-5+1) ))
    #T=1
    sleep $T
    D=`date`
    H=`hostname`
    I="Job input: $1"
    S="Job ran for $T seconds."
    #echo ${OUT} >> $FILE
    printf "=========================\n%s\n%s\n%s\n%s\n" "$D" "$H" "$I" "$S" >> $FILE

Here’s a job that MN uses: cd /home/mn/tmp/DUD_decoys/T1_2A_E_Rot/cox2 /pro/icm/icm/icmng /pro/icm/icm/_dockScan \ from=1 to=100 -E confs=3 thorough=1 vlsDUD \ >&d3\_2001\_LOG& wait

MPI

This is another can of worms. Consider these nice man page passages:

man sge_pe
pe_name
    The  name  of  the  parallel  environment  as  defined  for
    pe_name in sge_types(1).  To be used in the qsub(1) -pe switch.
man sge_types
pe_name
    A  "pe_name"  is  the  name  of  a Sun Grid Engine parallel
    environment described in sge_pe(5).

What a nightmare. The best information about this seems to be this much more reasonable website.

Avoiding SGE

The Network Queueing System (NQS) seems to be an ancient GPL queueing system written for/with NASA. No idea how useful/useless it is today.

GNUBatch looks interesting.