Xeduler

The simplest way to run millions of jobs on hundreds of compute nodes without overloading.

Overview

Xeduler is a simple job scheduler. It is designed to:

have no exotic dependencies
be very small
not waste CPU resources
be completely comprehensible
be simple to configure
leverage existing user expertise for interfacing
be freely available
publicly-licensed

Xeduler simplistically (FIFO) schedules and dispatches jobs to multiple resources. Because this is all it does, it is possible for an administrator to deploy and completely understand the scheduling system with very little effort. Also, by focusing on such a limited scope, a user can preserve his choice of tools for other things or choose not to do anything else.

There are many things that Xeduler does not do. Xeduler does not do intelligent, NP hard bin packing kind of scheduling; besides being very hard to do, it is very hard to make it broadly applicable. Xeduler does not provide health monitoring services for resources. Why should it? Maybe you don’t need that. Maybe your resources are very exotic and you want a custom monitoring system. Whatever your monitoring requirements are, Xeduler makes it easy to integrate them. Xeduler does not do load balancing. Again, requirements for that could be so varied that extreme complexity would be necessary to cover all possibilities. Xeduler is designed so that it would be quite simple to incorporate a load balancing system. Xeduler does not provide redundant systems to remotely execute jobs somewhere else. For that, use SSH or RSH or OpenMosix or whatever you like. Those facilities exist, work fine, and don’t need replacing.

Background

Although not exclusively, I am in the business of helping the users of Linux clusters to get good results from such a scheme. While establishing infrastructure for a cluster, the issue of job scheduling came up. What resources are available to solve this problem?

Note	This reflects my thoughts on the topic from 2005. Maybe things are better now.

OpenPBS

link

This is what was currently being used. Upon looking into it, I was thoroughly annoyed with OpenPBS’s apparent lack of openness. I have also been unimpressed with PBS’s usability from a user perspective. I found that I was writing scripts to interpret what commands like qstat were clumsily telling me and overcoming other such annoyances. When I tried to obtain a copy of the nominally open OpenPBS, I was challenged with a form and an incredibly persistent hard sell for their commercial product. As if that wasn’t unfriendly enough, I soon found that I was on the PBS mailing list and the password to download the software came a week later!

Sun Grid Engine

link

(Now known as Oracle Grid Engine.) This promising project seemed to have an endless list of dependencies. For example, it asked for a working installation of the Berkeley Database. I don’t really understand how to "get" this software. My package manager knows nothing about it. Searches point to Oracle which says "Berkeley DB is…easy to deploy…As a C library it can be installed and configured along with your application." So why did they make me hunt it down? (OpenLDAP doesn’t make such a hassle out of it.) I endeavored to get it installed and then the install scripts got confused about where my Java lived. That requirement was taken for granted as obvious. After some struggles trying to install Java, and sensing more such annoyances (like the special build system that maybe isn’t so special to Java people), I gave up. I was also completely intimidated by the look of the set up and configuration procedures, and the extent to which this package had to be installed on everything that would be affiliated with it.

Years later, SGE is the scheduling system I have used the most. Its interface is pretty much the same rotten morass as PBS. Its configuration is a complete pain. I’ll leave the topic with a reference to my SGE notes and this massive understatement from the SGE documentation:

"The Sun Grid Engine software provides administrators with a
tremendous amount of flexibility in configuring the cluster to
meet their needs. Unfortunately, with so much flexibility, it
sometimes can be challenging to know just what configuration
options are best for any given set of needs."

Condor

link

This was recommended to me and I had hopes that it would work out, but right away it had uncomfortable licensing issues and a disturbingly difficult to find source code. It seems that Condor’s license has been considerably improved, but there’s still an irritating form to fill just like with PBS and that’s just not the way I like to do business. I started to install this with a script, and then again by hand in a mutually exclusive manner that was based on some web-based documentation. Ultimately I was asked for the fully qualified names of the various nodes. This surprised me since I was planning on using local-only IP addresses that wouldn’t be accessible from the Internet. I can’t believe I’m the only person with that kind of strategy in mind. So not knowing how to let the system know that I didn’t have any resolvable machine names to give it, I gave up here too.

Torque

link

After I wrote Xeduler, I discovered this. Maybe it’s good. I’ve never tried it.

GNUBatch

link

Here’s one which I’ve only recently discovered. If there’s one style of software I’m comfortable with it’s GNU. I expect ./configure && make && make install and happy times. I tried to install GNUBatch and was thwarted. I’ll probably try again, but I was pretty disappointed by the hassle required to get going.

The Problem

These seem like ok products from an outsider’s perspective and if you can make use of them, great. For me, it was so much of an ordeal that I began to wonder if it might actually be easier to write my own dispatcher/batch scheduler. I use a lot of software and very, very rarely do I ever think, hmm, I should just reimplement this myself. First, there are usually dozens of possible solutions to the problem out there and second, it’s usually a very serious undertaking to replicate the functionality of some capital piece of software.

As it was becoming clear to me what a substantial effort was going to be required just to use someone else’s scheduler, I started to wonder why it was so difficult. I feel that there is possibly superfluous complexity in at least two areas. First, I think these packages do a lot more than I needed to have done. Second, I think that because of reason one, there is a lot of overhead in creating a lot of custom facilities (like the user interface commands, and database, etc).

My concept was to simplify the problem and break apart components that were not necessarily related so that they could be treated as smaller problems. I decided to focus exclusively on the batch scheduling aspect for now. Issues like health monitoring, disaster recovery, and resource optimization are all interesting and possibly useful. On the other hand, they are possibly not. My system does none of that, but rather provides for subsequent software to do these things on an as needed basis.

For me, the problem is that there are a "large" number of jobs that must be run on a collection of computation resources. The jobs must be dispatched in a way that does not overwhelm the resources. That’s it. An additional problem I wanted to solve was to eliminate a huge amount of the complexity of installation, deployment, and usage.

My Solution

Considering the problem to be just one of simply and sensibly dispatching jobs, I was able to take advantage of some creative tactics. I realized that everyday I use a job dispatching program. This program is natively installed in practically all Unix operating systems, and beyond. This job dispatcher is Bash. I like programming Bash and it seems to do a good job at doing exactly what it was designed to do - dispatch jobs. So while that may seem crazy, it does seem effective.

Using Bash as the platform to create my scheduler produces many excellent and interesting advantages. First, everyone has Bash and if one doesn’t, it’s generally easy to get. Certainly most Linux cluster users would be in a very unusual situation to not have immediate access to it. Second, Bash is well-known and reasonably understood by people having anything to do with Linux clusters. My system tries to leverage the existing untapped power of a native Linux environment.

Another advantage of using Bash is that platform problems are no longer a problem. One can use Bash on 64 bit architecture, under Cygwin, even on a Zaurus, and basically anywhere Bash runs. That’s a lot of flexibility.

Not only did I choose to use Bash, I also wanted to leverage the existing familiar facilities that a typical Unix system provides for dealing with various aspects of the system. Consequently there are not really any special commands that need to be learned in order to interact with the system. The scheduler is controlled by issuing normal commands that Unix users are already familiar with.

How It Is Used

The scheduling problem is fundamentally one of assigning jobs to resources. This implies a list or queue of jobs and a list of resources. Both of these lists are implemented using an ordinary file system. Any file system will do. If you’re particularly concerned about speed, the file system can be implemented in a ram disk which most Linux distributions provide (e.g. /dev/shm/).

The job queue is simply an ordinary directory which contains files which contain jobs. There can be one job per job file in the queue or there can be gigabytes of jobs per file in the queue, one job per line. To submit a job, simply move a job file into the job queue directory, or you can create it on the fly (echo /bin/jobtorun > JOB--a_sample_job). Job files begin with the JOB-- prefix; what follows that is up to you. The priority of the jobs is in the same order that the ls command will list them (this is locale dependent) which ensures predictable results. If a file contains multiple jobs, they are prioritized from first to last. To change a job file’s priority in the queue, simply rename it to something higher in the ls order (using the standard mv, generally).

The resource list is also simply a directory with meaningful files in it. The resource list directory contains one file per resource. In this case a resource could be a particular machine or a CPU on a machine or even just an arbitrary share on a machine if, for example, you want to keep a machine loaded with 11 jobs at all times for some reason.

You must have a QUEUE directory and a RESOURCES directory. These locations are set as variables (Qdir and Rdir) at the top of the program.

When the RESOURCES directory is listed, it could look something like this:

Sample ./RESOURCES/ Directory Listing

$ ls -l
-rw-r--r--  1 xed users 751 Dec 19 16:35 BUSY--10.0.0.44-spare_server
-rw-r--r--  1 xed users 751 Dec 19 16:35 BUSY--raven02
-rw-r--r--  1 xed users   0 Dec 17 17:57 DOWN--10.0.0.13-archive_server
-rw-r--r--  1 xed users   0 Dec 17 17:57 DOWN--raven13
-rw-r--r--  1 xed users   0 Dec 17 17:57 DOWN--robin13
-rw-r--r--  1 xed users 751 Dec 19 16:35 FREE--raven01
-rw-r--r--  1 xed users 751 Dec 19 16:35 FREE--raven04
-rw-r--r--  1 xed users 771 Dec 19 16:35 FREE--raven05-cpu1
-rw-r--r--  1 xed users 771 Dec 19 16:35 FREE--raven05-cpu2
-rw-r--r--  1 xed users 751 Dec 19 16:34 FREE--robin01
-rw-r--r--  1 xed users 751 Dec 19 16:34 FREE--robin02
-rw-r--r--  1 xed users   0 Dec 19 16:34 STOP--raven02

This shows many machines or nodes in the resources list in different states. The first state is DOWN. This indicates that a resource is not to be used. Perhaps there will be maintenance on that machine or just selective avoidance. The next status is BUSY. This means that a job has been dispatched to this machine and Xeduler is waiting to hear back from it. The third state is FREE and these are the machines to which jobs are sent as they arrive in the job queue. The FREE machines are chosen in the order they appear in the ls list.

The last state is STOP. In the sample listing, raven02 is both BUSY and it also has a STOP entry. The STOP feature allows the user to pass along the intention that a resource be taken DOWN after it finishes being BUSY. Basically the user just creates a file that matches a BUSY resource but with a stop in place (touch STOP--raven02) and when raven02 is finished with its job, it will go DOWN instead of taking more jobs. You could simply rename a FREE node to DOWN, but it’s best to use the STOP since you may not know when a job might show up and go to BUSY just after you renamed it. The chance of this happening is very small, but the STOP feature precludes this problem.

To restore a DOWN machine to FREE again, simply renaming it works fine. You can’t confuse the scheduler by adding resources, just by taking them away.

The STOP feature could be used by a separate program which monitors the health of the nodes. If a node stops working they way you want (it stops responding, it’s cpu usage goes very high, whatever), you can have the monitor program touch a STOP-- file.

Another clever time to use STOP is in a cron job. Perhaps the office staff’s desktop machines do nothing all night. One could set a cron job to do something like this:

30 19 * * 1,2,3,4,5  mv DOWN--10.0.0.124-conf_rm_mac FREE--10.0.0.124-conf_rm_mac
30 5  * * 1,2,3,4,5  touch STOP--10.0.0.124-conf_rm_mac

This should cause a node to be down between 05h30 and 19h30 on weekdays and available the rest of the time. You can see that by leveraging the existing power of typical Unix systems you can get a lot of very nice features for essentially no extra complexity.

The format of the node filenames is somewhat important. The first part of the name must be the state codes (which can all be customized) such as DOWN--, STOP--, etc. Then comes the machine name that your remote shell program will use. Since listing order is important, you may want to prioritize resources by having special listings in the /etc/hosts file to look up. Finally, anything after and including a dash is considered for user information only. Here you can say more descriptive things about the resource. So in the example, raven05 has 2 cpus that should be loaded. The two entries of raven05 mean that two jobs will be sent to the name raven05. The -cpuX part is informational.

To stop Xeduler, you can kill it in the normal way, but this might be too abrupt if there are jobs running. If a file name with the word quit (case-insensitive) is found in the resource directory, then no new jobs are dispatched and the Xeduler is stopped. Currently running jobs continue on their own until they are complete.

How It Works

Xeduler tackles the problem of what to send where, that’s it. By not meddling in the affairs of other machines, the complexity of the system is reduced to an absolute minimum. By being so focused in scope it is my hope that it will be possible to be very creative with this tool. Xeduler has no idea what a job is really doing. It doesn’t really care where it goes. Unlike other systems, it is not in constant contact with the machine’s running jobs. This has its limitations for sure, but it reduces the complexity enormously. Other scheduling systems work like a horse race - dedicated jockeys follow and guide the real workhorses the whole way; Xeduler is like racing pigeons - you let the jobs go and you hope they come back.

The huge advantage of the "pigeon" system, of course, is that you can utilize any machine you have access to without specifically preparing that machine to work as part of a cluster. It’s onerous enough to have to set up everything any scheduler requires; to repeat that feat for every node minimally doubles that effort. It also severely discourages heterogeneity. Xeduler, on the other hand, can take advantage of any resource you can log into without further configuration annoyance.

Every time Xeduler discovers a job that can and should be run on an available resource, it prepares a little wrapper script that should be able to accomplish that. This script is the actual contents of a BUSY file. You can read the contents of the BUSY files to understand what’s going on, but it’s probably unwise to modify them while the jobs are underway.

When there are no jobs and there are resources available, Xeduler does poll to see if any jobs were deposited. This is relatively low impact and only incurred when there are resources available. When the system determines that all the resources are full, it sends itself a SIGSTOP and effectively goes to sleep. Why this odd strategy of falling asleep when things are at their busiest? First of all, it is likely that the user can be using the scheduling machine as a compute node too and we don’t want to delay the computation with scheduling polling. What if more jobs are deposited while the scheduler is sleeping? Turns out it doesn’t matter as there is nowhere for them to go anyway.

The trick is that the wrapper program which dispatched the job is, upon completion, setup to check if Xeduler is sleeping. If so, it wakes it up. Now there is obviously a resource newly freed and the next job can be assigned. This means that the scheduler overhead is absolutely nothing during full-capacity operation and that newly freed resources are captured and utilized immediately.

Because I was so bewildered by the other systems I tried to use, I intended to make Xeduler very easy to understand in its entirety. The source code that does the actual work is shockingly small, only 2951 bytes. The program as distributed adds 5kB of illustrative comments in the code with the intention that a normal administrator of a Linux cluster should be able to read and understand the entire source code in less time than it would take to read this page until here. You are encouraged to do that and empower yourself with the ability to utilize Xeduler’s capabilities to the limits of your skills and imagination.

Download

The complete source code should be easy to obtain. It is a single Bash text file suitable for reading.

Current Limitations

Xeduler was developed to solve a very specific problem which I felt was easier to solve from scratch than to install a more comprehensive package that did more than I needed. Consequently, there are plenty of missing things that would be nice to have.

Some of these missing things should be separate programs and perhaps I’ll get around to implementing them. The two big ones are health monitoring and load balancing. By creating separate facilities to handle both of these tasks, Xeduler can still be used for what it’s good at - job scheduling. The interface which these advanced features would use is as easy as it gets.

Another area that could be easily improved with separate software is fault recovery. If you send a job out and the node it’s running on dies, then what happens? As far as Xeduler is concerned, that’s not a scheduling problem strictly speaking. If the machine where Xeduler itself is running dies, then that is more of a scheduling problem and there are many possible tactics to deal with that. The fact that everything is naturally saved to a file system means that when the machine is rebooted, it would be fairly straightforward to reconstruct what was where and check up on that. Again, best left to some other program that specializes in such a thing.

Xeduler’s main limitation right now which should be fixed is one of permissions. I created this to be run by one user and that’s what it does. But it would be trivial to add the capabilities to have it run as root and dispatch jobs with an effective id of the owner of the job file. If that’s interesting to you, let me know and I’ll add it.

Another improvement which could be made to Xeduler itself is in configurability. Already there is a ton of flexibility in what you can do, especially for how easy it is to do simple things. However, if you had situations where you wanted certain jobs to be favored on certain machines, and others on others or some users can only use these nodes, while others can only use the cluster at these times, etc. this kind of thing would just take implementing a complex system to sort that out. I feel it was correct to design and actually create the simple system first so that users with a simple problem can have a simple solution. Simplicity was the big objective here.

The final limitation that should be mentioned is performance. While Xeduler works well for many work loads and is quite clever and elegant, it does use Bash. This means that if you’re trying to dispatch thousands (or perhaps even dozens) of jobs per second, Bash could be the limiting factor that begins to cause confusion. Perhaps one day I’ll implement the same functionality in a C program which should eliminate all of the performance issues.