The simplest way to run millions of jobs on hundreds of compute nodes without overloading.
Overview
Xeduler is a simple job scheduler. It is designed to:
-
have no exotic dependencies
-
be very small
-
not waste CPU resources
-
be completely comprehensible
-
be simple to configure
-
leverage existing user expertise for interfacing
-
be freely available
-
publicly-licensed
Xeduler simplistically (FIFO) schedules and dispatches jobs to multiple resources. Because this is all it does, it is possible for an administrator to deploy and completely understand the scheduling system with very little effort. Also, by focusing on such a limited scope, a user can preserve his choice of tools for other things or choose not to do anything else.
There are many things that Xeduler does not do. Xeduler does not do intelligent, NP hard bin packing kind of scheduling; besides being very hard to do, it is very hard to make it broadly applicable. Xeduler does not provide health monitoring services for resources. Why should it? Maybe you don’t need that. Maybe your resources are very exotic and you want a custom monitoring system. Whatever your monitoring requirements are, Xeduler makes it easy to integrate them. Xeduler does not do load balancing. Again, requirements for that could be so varied that extreme complexity would be necessary to cover all possibilities. Xeduler is designed so that it would be quite simple to incorporate a load balancing system. Xeduler does not provide redundant systems to remotely execute jobs somewhere else. For that, use SSH or RSH or OpenMosix or whatever you like. Those facilities exist, work fine, and don’t need replacing.
Background
Although not exclusively, I am in the business of helping the users of Linux clusters to get good results from such a scheme. While establishing infrastructure for a cluster, the issue of job scheduling came up. What resources are available to solve this problem?
Note
|
This reflects my thoughts on the topic from 2005. Maybe things are better now. |
OpenPBS
This is what was currently being used. Upon looking into it,
I was thoroughly annoyed with OpenPBS’s apparent lack of openness. I
have also been unimpressed with PBS’s usability from a user
perspective. I found that I was writing scripts to interpret what
commands like qstat
were clumsily telling me and overcoming other
such annoyances. When I tried to obtain a copy of the nominally open
OpenPBS, I was challenged with a form and an incredibly persistent
hard sell for their commercial product. As if that wasn’t unfriendly
enough, I soon found that I was on the PBS mailing list and the
password to download the software came a week later!
Sun Grid Engine
(Now known as Oracle Grid Engine.) This promising project seemed to have an endless list of dependencies. For example, it asked for a working installation of the Berkeley Database. I don’t really understand how to "get" this software. My package manager knows nothing about it. Searches point to Oracle which says "Berkeley DB is…easy to deploy…As a C library it can be installed and configured along with your application." So why did they make me hunt it down? (OpenLDAP doesn’t make such a hassle out of it.) I endeavored to get it installed and then the install scripts got confused about where my Java lived. That requirement was taken for granted as obvious. After some struggles trying to install Java, and sensing more such annoyances (like the special build system that maybe isn’t so special to Java people), I gave up. I was also completely intimidated by the look of the set up and configuration procedures, and the extent to which this package had to be installed on everything that would be affiliated with it.
Years later, SGE is the scheduling system I have used the most. Its interface is pretty much the same rotten morass as PBS. Its configuration is a complete pain. I’ll leave the topic with a reference to my SGE notes and this massive understatement from the SGE documentation:
"The Sun Grid Engine software provides administrators with a
tremendous amount of flexibility in configuring the cluster to
meet their needs. Unfortunately, with so much flexibility, it
sometimes can be challenging to know just what configuration
options are best for any given set of needs."
Condor
This was recommended to me and I had hopes that it would work out, but right away it had uncomfortable licensing issues and a disturbingly difficult to find source code. It seems that Condor’s license has been considerably improved, but there’s still an irritating form to fill just like with PBS and that’s just not the way I like to do business. I started to install this with a script, and then again by hand in a mutually exclusive manner that was based on some web-based documentation. Ultimately I was asked for the fully qualified names of the various nodes. This surprised me since I was planning on using local-only IP addresses that wouldn’t be accessible from the Internet. I can’t believe I’m the only person with that kind of strategy in mind. So not knowing how to let the system know that I didn’t have any resolvable machine names to give it, I gave up here too.
GNUBatch
Here’s one which I’ve only recently discovered. If there’s one style
of software I’m comfortable with it’s GNU. I expect ./configure &&
make && make install
and happy times. I tried to install GNUBatch and
was thwarted. I’ll probably try again, but I was pretty disappointed
by the hassle required to get going.
The Problem
These seem like ok products from an outsider’s perspective and if you can make use of them, great. For me, it was so much of an ordeal that I began to wonder if it might actually be easier to write my own dispatcher/batch scheduler. I use a lot of software and very, very rarely do I ever think, hmm, I should just reimplement this myself. First, there are usually dozens of possible solutions to the problem out there and second, it’s usually a very serious undertaking to replicate the functionality of some capital piece of software.
As it was becoming clear to me what a substantial effort was going to be required just to use someone else’s scheduler, I started to wonder why it was so difficult. I feel that there is possibly superfluous complexity in at least two areas. First, I think these packages do a lot more than I needed to have done. Second, I think that because of reason one, there is a lot of overhead in creating a lot of custom facilities (like the user interface commands, and database, etc).
My concept was to simplify the problem and break apart components that were not necessarily related so that they could be treated as smaller problems. I decided to focus exclusively on the batch scheduling aspect for now. Issues like health monitoring, disaster recovery, and resource optimization are all interesting and possibly useful. On the other hand, they are possibly not. My system does none of that, but rather provides for subsequent software to do these things on an as needed basis.
For me, the problem is that there are a "large" number of jobs that must be run on a collection of computation resources. The jobs must be dispatched in a way that does not overwhelm the resources. That’s it. An additional problem I wanted to solve was to eliminate a huge amount of the complexity of installation, deployment, and usage.
My Solution
Considering the problem to be just one of simply and sensibly dispatching jobs, I was able to take advantage of some creative tactics. I realized that everyday I use a job dispatching program. This program is natively installed in practically all Unix operating systems, and beyond. This job dispatcher is Bash. I like programming Bash and it seems to do a good job at doing exactly what it was designed to do - dispatch jobs. So while that may seem crazy, it does seem effective.
Using Bash as the platform to create my scheduler produces many excellent and interesting advantages. First, everyone has Bash and if one doesn’t, it’s generally easy to get. Certainly most Linux cluster users would be in a very unusual situation to not have immediate access to it. Second, Bash is well-known and reasonably understood by people having anything to do with Linux clusters. My system tries to leverage the existing untapped power of a native Linux environment.
Another advantage of using Bash is that platform problems are no longer a problem. One can use Bash on 64 bit architecture, under Cygwin, even on a Zaurus, and basically anywhere Bash runs. That’s a lot of flexibility.
Not only did I choose to use Bash, I also wanted to leverage the existing familiar facilities that a typical Unix system provides for dealing with various aspects of the system. Consequently there are not really any special commands that need to be learned in order to interact with the system. The scheduler is controlled by issuing normal commands that Unix users are already familiar with.
How It Is Used
The scheduling problem is fundamentally one of assigning jobs to
resources. This implies a list or queue of jobs and a list of
resources. Both of these lists are implemented using an ordinary
file system. Any file system will do. If you’re particularly concerned
about speed, the file system can be implemented in a ram disk which
most Linux distributions provide (e.g. /dev/shm/
).
The job queue is simply an ordinary directory which contains files
which contain jobs. There can be one job per job file in the queue or
there can be gigabytes of jobs per file in the queue, one job per
line. To submit a job, simply move a job file into the job queue
directory, or you can create it on the fly (echo /bin/jobtorun >
JOB--a_sample_job
). Job files begin with the JOB--
prefix; what
follows that is up to you. The priority of the jobs is in the same
order that the ls command will list them (this is locale dependent)
which ensures predictable results. If a file contains multiple jobs,
they are prioritized from first to last. To change a job file’s
priority in the queue, simply rename it to something higher in the ls
order (using the standard mv
, generally).
The resource list is also simply a directory with meaningful files in it. The resource list directory contains one file per resource. In this case a resource could be a particular machine or a CPU on a machine or even just an arbitrary share on a machine if, for example, you want to keep a machine loaded with 11 jobs at all times for some reason.
You must have a QUEUE
directory and a RESOURCES
directory. These
locations are set as variables (Qdir
and Rdir
) at the top of the
program.
When the RESOURCES
directory is listed, it could look something like this:
$ ls -l
-rw-r--r-- 1 xed users 751 Dec 19 16:35 BUSY--10.0.0.44-spare_server
-rw-r--r-- 1 xed users 751 Dec 19 16:35 BUSY--raven02
-rw-r--r-- 1 xed users 0 Dec 17 17:57 DOWN--10.0.0.13-archive_server
-rw-r--r-- 1 xed users 0 Dec 17 17:57 DOWN--raven13
-rw-r--r-- 1 xed users 0 Dec 17 17:57 DOWN--robin13
-rw-r--r-- 1 xed users 751 Dec 19 16:35 FREE--raven01
-rw-r--r-- 1 xed users 751 Dec 19 16:35 FREE--raven04
-rw-r--r-- 1 xed users 771 Dec 19 16:35 FREE--raven05-cpu1
-rw-r--r-- 1 xed users 771 Dec 19 16:35 FREE--raven05-cpu2
-rw-r--r-- 1 xed users 751 Dec 19 16:34 FREE--robin01
-rw-r--r-- 1 xed users 751 Dec 19 16:34 FREE--robin02
-rw-r--r-- 1 xed users 0 Dec 19 16:34 STOP--raven02
This shows many machines or nodes in the resources list in different
states. The first state is DOWN
. This indicates that a resource is not
to be used. Perhaps there will be maintenance on that machine or just
selective avoidance. The next status is BUSY
. This means that a job
has been dispatched to this machine and Xeduler is waiting to hear
back from it. The third state is FREE
and these are the machines to
which jobs are sent as they arrive in the job queue. The FREE
machines
are chosen in the order they appear in the ls
list.
The last state is STOP
. In the sample listing, raven02
is both
BUSY
and it also has a STOP
entry. The STOP
feature allows the
user to pass along the intention that a resource be taken DOWN
after
it finishes being BUSY
. Basically the user just creates a file that
matches a BUSY
resource but with a stop in place (touch
STOP--raven02
) and when raven02
is finished with its job, it will
go DOWN
instead of taking more jobs. You could simply rename a
FREE
node to DOWN
, but it’s best to use the STOP
since you may
not know when a job might show up and go to BUSY
just after you
renamed it. The chance of this happening is very small, but the STOP
feature precludes this problem.
To restore a DOWN
machine to FREE
again, simply renaming it works
fine. You can’t confuse the scheduler by adding resources, just by
taking them away.
The STOP
feature could be used by a separate program which monitors
the health of the nodes. If a node stops working they way you want (it
stops responding, it’s cpu usage goes very high, whatever), you can
have the monitor program touch a STOP--
file.
Another clever time to use STOP
is in a cron job. Perhaps the office
staff’s desktop machines do nothing all night. One could set a cron
job to do something like this:
30 19 * * 1,2,3,4,5 mv DOWN--10.0.0.124-conf_rm_mac FREE--10.0.0.124-conf_rm_mac
30 5 * * 1,2,3,4,5 touch STOP--10.0.0.124-conf_rm_mac
This should cause a node to be down between 05h30 and 19h30 on weekdays and available the rest of the time. You can see that by leveraging the existing power of typical Unix systems you can get a lot of very nice features for essentially no extra complexity.
The format of the node filenames is somewhat important. The first part
of the name must be the state codes (which can all be customized) such
as DOWN--
, STOP--
, etc. Then comes the machine name that your remote
shell program will use. Since listing order is important, you may want
to prioritize resources by having special listings in the /etc/hosts
file to look up. Finally, anything after and including a dash is
considered for user information only. Here you can say more
descriptive things about the resource. So in the example, raven05
has
2 cpus that should be loaded. The two entries of raven05
mean that two
jobs will be sent to the name raven05
. The -cpuX
part is
informational.
To stop Xeduler, you can kill it in the normal way, but this might be
too abrupt if there are jobs running. If a file name with the word
quit
(case-insensitive) is found in the resource directory, then no
new jobs are dispatched and the Xeduler is stopped. Currently running
jobs continue on their own until they are complete.
How It Works
Xeduler tackles the problem of what to send where, that’s it. By not meddling in the affairs of other machines, the complexity of the system is reduced to an absolute minimum. By being so focused in scope it is my hope that it will be possible to be very creative with this tool. Xeduler has no idea what a job is really doing. It doesn’t really care where it goes. Unlike other systems, it is not in constant contact with the machine’s running jobs. This has its limitations for sure, but it reduces the complexity enormously. Other scheduling systems work like a horse race - dedicated jockeys follow and guide the real workhorses the whole way; Xeduler is like racing pigeons - you let the jobs go and you hope they come back.
The huge advantage of the "pigeon" system, of course, is that you can utilize any machine you have access to without specifically preparing that machine to work as part of a cluster. It’s onerous enough to have to set up everything any scheduler requires; to repeat that feat for every node minimally doubles that effort. It also severely discourages heterogeneity. Xeduler, on the other hand, can take advantage of any resource you can log into without further configuration annoyance.
Every time Xeduler discovers a job that can and should be run on an
available resource, it prepares a little wrapper script that should be
able to accomplish that. This script is the actual contents of a
BUSY
file. You can read the contents of the BUSY
files to
understand what’s going on, but it’s probably unwise to modify them
while the jobs are underway.
When there are no jobs and there are resources available, Xeduler does
poll to see if any jobs were deposited. This is relatively low impact
and only incurred when there are resources available. When the system
determines that all the resources are full, it sends itself a
SIGSTOP
and effectively goes to sleep. Why this odd strategy of
falling asleep when things are at their busiest? First of all, it is
likely that the user can be using the scheduling machine as a compute
node too and we don’t want to delay the computation with scheduling
polling. What if more jobs are deposited while the scheduler is
sleeping? Turns out it doesn’t matter as there is nowhere for them to
go anyway.
The trick is that the wrapper program which dispatched the job is, upon completion, setup to check if Xeduler is sleeping. If so, it wakes it up. Now there is obviously a resource newly freed and the next job can be assigned. This means that the scheduler overhead is absolutely nothing during full-capacity operation and that newly freed resources are captured and utilized immediately.
Because I was so bewildered by the other systems I tried to use, I intended to make Xeduler very easy to understand in its entirety. The source code that does the actual work is shockingly small, only 2951 bytes. The program as distributed adds 5kB of illustrative comments in the code with the intention that a normal administrator of a Linux cluster should be able to read and understand the entire source code in less time than it would take to read this page until here. You are encouraged to do that and empower yourself with the ability to utilize Xeduler’s capabilities to the limits of your skills and imagination.
Download
The complete source code should be easy to obtain. It is a single Bash text file suitable for reading.
Current Limitations
Xeduler was developed to solve a very specific problem which I felt was easier to solve from scratch than to install a more comprehensive package that did more than I needed. Consequently, there are plenty of missing things that would be nice to have.
Some of these missing things should be separate programs and perhaps I’ll get around to implementing them. The two big ones are health monitoring and load balancing. By creating separate facilities to handle both of these tasks, Xeduler can still be used for what it’s good at - job scheduling. The interface which these advanced features would use is as easy as it gets.
Another area that could be easily improved with separate software is fault recovery. If you send a job out and the node it’s running on dies, then what happens? As far as Xeduler is concerned, that’s not a scheduling problem strictly speaking. If the machine where Xeduler itself is running dies, then that is more of a scheduling problem and there are many possible tactics to deal with that. The fact that everything is naturally saved to a file system means that when the machine is rebooted, it would be fairly straightforward to reconstruct what was where and check up on that. Again, best left to some other program that specializes in such a thing.
Xeduler’s main limitation right now which should be fixed is one of permissions. I created this to be run by one user and that’s what it does. But it would be trivial to add the capabilities to have it run as root and dispatch jobs with an effective id of the owner of the job file. If that’s interesting to you, let me know and I’ll add it.
Another improvement which could be made to Xeduler itself is in configurability. Already there is a ton of flexibility in what you can do, especially for how easy it is to do simple things. However, if you had situations where you wanted certain jobs to be favored on certain machines, and others on others or some users can only use these nodes, while others can only use the cluster at these times, etc. this kind of thing would just take implementing a complex system to sort that out. I feel it was correct to design and actually create the simple system first so that users with a simple problem can have a simple solution. Simplicity was the big objective here.
The final limitation that should be mentioned is performance. While Xeduler works well for many work loads and is quite clever and elegant, it does use Bash. This means that if you’re trying to dispatch thousands (or perhaps even dozens) of jobs per second, Bash could be the limiting factor that begins to cause confusion. Perhaps one day I’ll implement the same functionality in a C program which should eliminate all of the performance issues.