SWIG (Simplified Wrapper and Interface Generator) Notes

The SWIG website has a lot of useful stuff including what SWIG is and a nice tutorial on SWIG.

SWIG is a way for software developers to take fundamental essential code written in C or C++ and make it natively available to other more convenient languages.

SWIG can automatically generate modules for the following programming languages:

AllegroCL
C# Mono
C# .NET
CFFI
CHICKEN
CLISP
D
Go language
Guile
Java
Lua
MzScheme/Racket
Ocaml
Octave
Perl
PHP
Python
R
Ruby
Tcl/Tk

An Example

Python Version

First let’s imagine a computationally expensive problem. Say we have a large amount of text that we need to find instances of alliteration in. For our simplified purposes, this means we want to give a file name and get back the number of times a word is followed by another word which starts with the same letter (and if you look closely I actually give bonus points for multiple consecutive instances). The first thing I might try is to write a Python program:

alliteration.py

#!/usr/bin/python
import sys

def filesalliteration(filename):
    state= 0
    old= 0
    n= 0
    t= 0
    f= open(filename,'r')
    while 1:
        c= f.read(1)
        if not c:
            break
        if c == '\n':
            state= 0
        elif c == ' ':
            state= 0
        elif c == "\t":
            state= 0
        else:
            if state == 0:
                state= 1
                if c == old:
                    n+=1
                    t+=n
                else:
                    old= c
                    n= 0
    return t

if __name__ == '__main__':
    for fn in sys.argv[1:]:
        a= filesalliteration(fn)
        print "[%s]:%d"%(fn,a)

This is what this program looks like when run:

$ ./alliteration.py samples/159*[samples/1591.txt.utf8]:1693
[samples/1598.txt.utf8]:1278

Running this on multiple files looks like this:

$ time ./alliteration.py samples/* | awk 'BEGIN{FS=":"}{T+=$2;N++}END{printf "Total %d in %d files. Time:",T,N}'
Total 83440 in 27 files. Time:6.683

Note that I have summarized the output with Awk. The real thing to notice here is the time (in seconds).

C Version

Let’s say the Python version is just unacceptably slow. I can now write a C version of the program. Here’s what that looks like:

alliteration.c

/* alliteration.c */
#include  <stdio.h>

int main(int argc, char * argv[]) {
    int count;
    int allis;
    if (argc > 1) {
        for (count = 1; count < argc; count++) {
            allis= filesalliteration(argv[count]);
            printf("[%s]:%d\n", argv[count],allis);
        }
    }
}

int filesalliteration(char *fn){
    char c;
    int old=0, n=0, t=0, state=0;
    FILE *thefile = fopen( fn, "r" );
    while ((c= getc(thefile)) != EOF) {
        switch(c) {
        case '\n' :
            state= 0;
            break;
        case ' ' :
            state= 0;
            break;
        case '\t':
            state= 0;
            break;
        default:
            if (state == 0) {
                state= 1;
                if (c == old) {
                    n++;
                    t+=n;
                }
                else {
                    old= c;
                    n= 0;
                }
            }
        break;
        }
    }
    return t;
}

Notice it’s very similar to the Python program (most C programs aren’t so lucky!) We have to compile this program before running it which looks like this:

$ gcc -o alliteration alliteration.c
$ time ./alliteration samples/* | awk 'BEGIN{FS=":"}{T+=$2;N++}END{printf "Total %d in %d files. Time:",T,N}'
Total 83440 in 27 files. Time:0.184

Notice that time dropped from 6.6 seconds in the Python program to .2 seconds in the C version. But C is a fussy language and not always fun or quick when it comes to developing bigger bodies of code.

SWIG

Now let’s explore what SWIG can do.

Interface File

The first thing we must do is to compose an interface file for SWIG to know what it should be working on. In this case that would look like this:

alliteration.i

%module allit
%{ extern int filesalliteration(char *fn); %}
extern int filesalliteration(char *fn);

The details are only complex if you have complex requirements, but the simple explanation is that the list of functions to expect must be declared here. Apparently it is possible to even use C header files for this. For a proper explanation of interfaces files, check out this documentation.

Running SWIG

Now run SWIG itself to prepare the module code and the wrapper code.

$ swig -python alliteration.i

Compiling The Wrapper

The wrapper code is created as alliteration_wrap.c and must be compiled. You need to specify the development headers (make sure they’re present: yum install python-devel, apt-get install python-dev, etc). Now compile the wrapper:

$ gcc -fPIC -c alliteration.c alliteration_wrap.c -I/usr/include/python2.4

Linking The Module Into A Library

Now there is an object file for your C program (alliteration.o) and for the wrapper (alliteration_wrap.o). These must be put into a shared object library (_allit.so) that the Python module will link to. Here’s how to use the linker to assemble this shared library:

$ ld -shared alliteration.o alliteration_wrap.o -o _allit.so

Testing in Python

If all went well, you should now be able to use a functional Python module. Let’s try it out directly:

$ python
Python 2.4.3 (#1, Sep 21 2011, 19:55:41)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-51)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import allit
>>> allit.filesalliteration('samples/1598.txt.utf8')
1278

Seems to work. Now our Python program can look like this:

alliteration+c.py

#!/usr/bin/python
import allit
import sys

if __name__ == '__main__':
    for fn in sys.argv[1:]:
        a= allit.filesalliteration(fn)
        print "[%s]:%d"%(fn,a)

Results

Now we can compare the performance of these various strategies:

$ time ./alliteration.py samples/* | awk 'BEGIN{FS=":"}{T+=$2;N++}END{printf "Total %d in %d files. Time:",T,N}'
Total 83440 in 27 files. Time:6.871
$ time ./alliteration samples/* | awk 'BEGIN{FS=":"}{T+=$2;N++}END{printf "Total %d in %d files. Time:",T,N}'
Total 83440 in 27 files. Time:0.177
$ time ./alliteration+c.py samples/* | awk 'BEGIN{FS=":"}{T+=$2;N++}END{printf "Total %d in %d files. Time:",T,N}'
Total 83440 in 27 files. Time:0.193

You can see that the native C is only slightly faster than the Python code. This is a huge improvement from the native Python. The nice thing about SWIG is that if you have some fundamentally effective C code, you can make useful module for many languages pretty much automatically and get a lot more people to use and take an interest in your software.

System Call Alternative

One question that arises for simple cases like the example shown is why not just use system calls from Python to run the C code? That would be a Python program that looks like this:

#!/usr/bin/python
import sys
import os
if __name__ == '__main__':
    for fn in sys.argv[1:]:
        for out in os.popen('./alliteration '+fn).readlines():
            print out.strip()

Besides turning into a potential security issue with possibly untrusted input controlling a system call, the performance is not very competitive (in this example at least):

$ time ./alliteration+sh.py samples/* | awk 'BEGIN{FS=":"}{T+=$2;N++}END{printf "Total %d in %d files. Time:",T,N}'
Total 83440 in 27 files. Time:0.323

Unix One Liner!

I wondered how just doing this operation in a single line using classic Unix tools compared. Here is the entire process succinctly expressed as a single Unix pipeline:

$ time cat samples/* | tr " " "\n" | grep -v "^$" | sed "s/^\(.\).*/\1/" | awk '{if(L==$1){N++;C+=N}else{L=$1;N=0}}END{printf"Total %d. Time:",C}'
Total 83440. Time:5.440