I do work for a computational structural biology lab and one related topic that I’ve become somewhat interested in is protein secondary structure. The super quick summary is that your body (and all life that we know of) has DNA encoded as a blueprint for what to build. You would think that by looking at the DNA we could tell what sort of 3d physical biomolecular machines would result. But we can’t! It’s a huge unresolved problem.

To understand secondary structure, imagine a skein of yarn where the yarn can change colors ever millimeter so that its position can be encoded in binary. In simpler terms, you can tell what exact part of the yarn you’re looking at anywhere along it. Now you want to know for mm number N, will this be part of the yet unknit sweater on the outside or the inside? Note that this is a lot easier than asking which part of the sweater (sleeve, neck, etc) will it show up on? You could imagine something easy where it looks like this: in, in, in, out, out, in, in, in, out, out, etc. Protein secondary structure is a bit like this problem only imagine the yarn after a cat has thoroughly tangled it and we want to know, at mm N is it in a big knot? Or is it in a dangling loop? But you can see that if a section was just in a big loop, it might still be in a loop (statistically) or after enough of that, it could be time for it to be expected to transition to a big knot.

With proteins, the yarn is not homogeneous. Each little segment is a single amino acid comprising the polypeptide chain. There are 21 different flavors of amino acid. In my sweater example, there is an inside that touches your body and an outside. In protein secondary structures, there are three major categories, alpha helices (think old telephone handset cords), beta sheets (the main ingredient of silk to loosely stitch together my sweater analogy), and what I like to call, "other". There are other classifications schemes that are fancier, but let’s start simple.

The problem then is this. Starting with an amino acid sequence (converting from DNA to amino acid is not complicated), what sorts of loops and turns can be expected from each part of the chain even though it is unknown how those loops will be arranged in the big picture? (Sweater? Hat? Mittens? Don’t know that.)

Can it be done? I felt like the answer should be yes. It seemed to me like this was a similar problem to machine translation of human languages like Spanish to English. In fact, here is a much closer analogy. Imagine you had a huge body of text and some linguistics researchers had laboriously annotated every part of speech of every word, where "P" is pronoun, "V" is verb, "A" is adjective, "R" is preposition, etc. For example.


If I had a huge quantity of such annotated sentences, could I train a computer to tell me, what’s the part of speech code at position N of a new novel sentence? The fact that machine translation exists tells us that the answer is probably yes since this problem seems easier. Here is what the training data looks like for protein secondary structure. The top line is the amino acid (e.g. "W" is glutamate, the main stuff of MSG). The bottom line is the secondary structure codes. "H" is alpha helix, "G" - 3/10 helix, "I" - pi helix, "E" - beta strand, "B" is beta-bridge, "_" coil, etc.


It brings up an information theory question. Did evolution devise something much more horrifically complicated to encode semantics in our genomes and proteomes than our cultural evolution did with our spoken language? I’m not ruling it out, but I’m also not immediately thinking of a reason why that would be likely to be true.

In researching this topic I discovered that I wasn’t the first person to suspect that this problem could be solved better with machine learning than other approaches. Indeed, here is a paper from 1993 which attempts to apply neural networks to this problem. Of course the tiny little model that was used was adorably pathetic by today’s standards. It is the most basic neural net architecture right out of an introductory textbook. I suspect that such an architecture will not make any sense. And this paper only trained on about 126 sequences. That’s really not enough for any sensible thing. By comparison, I have 1400 ready to go right now with between 50 an 70 AA (selected to minimize length variation). I can easily get 100k more from the PDB. Instead of two tiny layers, I can have a dozen with 1000s of neurons each to really give my model a good shot at global feature detection.

Over the subsequent years, many other papers have been published and the complexity and accuracies have steadily gone up. Although I hardly understand all the technicalities of machine learning, when I first heard about recurrent neural networks I thought it sounded like a pretty good technique for the secondary structure problem. And sure enough, it has been tried (last year) and it seems to be the way to go. That paper does seem like they really went crazy with the model complexity. I am curious to know if that is really necessary, but I’m still working out how to implement this myself. And find time to!

That’s the set up. Why do I think RNNs might be useful? They tend to do well with matching input sequences to correct output sequences and that’s the exact requirement. They allow the model to be able to remember features from earlier in the sequence which may be useful later on. And, well, they seem kind of magical.

The basic concept of a recurrent neural network is that as the system processes input, the outputs are fed back into the system. The system is tuned to balance this process. First, it selectively accepts new valuable information from the inputs. Second it discards old information that seems to be ineffective. This is a massive simplification. There is fractal complexity at every turn and the tuning of the information gates can be quite baroque involving quite intricate circuitry. This incredibly well-crafted and clear lesson on the esoterica of RNNs is worth looking over just to see how to teach a difficult topic.

More research uncovered another exceptional resource. Stanford CS professor Andrej Karpathy produced this fantastic article, The Unreasonable Effectiveness of Recurrent Neural Networks. He backs up that sentiment with some amazing examples of what a (relatively) simple RNN can do. For example, he trains on the works of Shakespeare and is able to synthesize ersatz Shakespeare which I can barely discern from the real thing.

He synthesizes Wikipedia content, C code, and LaTeX from a math book that all looks shockingly real. The amazing thing about this is that it isn’t combining words or phrases or other high level tokens. This is generating plausible content based on characters.

Not only does Karpathy have the goods to do miracles with RNNs and then write lucidly about it, he provides source code to try it yourself. I must stop here and acknowledge that there are a lot of idiots in the world and this includes an astonishingly high percentage of computer programmers. I am typically pretty horrified with the code of other people, but I want to point out that I am capable of appreciating brilliant code when it makes its fleeting appearances. And Karpathy is a genius. Check out this brilliant program. Short, clear, and requiring no annoying hipster dependencies. Not only that, but even I got it working!

I was able to train it on the corpus of all my blog posts, about 500kB. When the program starts this is what comes out.

^Xc.d1M-^YM-^BZ:^U,M-^\zj`Vp,t\l)S(Ft@Z6Oq'1lvdQG5[M-^O}Q?-Us $

As you can see this is essentially pure noise. The guesses about what letter come next are essentially meaningless. This could be the beginning of something trained in Chinese.

But quickly, about 10 seconds later, this starts to appear.

g ]enecaatarpo nhtt3rle7arg r-wsor$
s AtM-Bu[-U/Ik : =s cg7srhpijeh)shsis s I7-. tess,=si#Yholef Lhoyg$
s feopMpcserhefBnrel=Is:n (Ne vWhte-P woonnbuk]H^CtTwot.che$
mile.e44ondviheioK_Wo $
ual te ct77lited me 3ywto-:_slOwh l aw _bM-^@A [onaavy hesbik,t slt tilbesssm
lfe7 bandnd_ptos$
2>yo i f cttaiw1ee0s $
atcrR saD$
no s aulvhel+nnabtucilscip ycas$
othedomdei.eni tP 5aord i nsson pionc ely or.o=minn _ilr me bouroeil niulasum$
ibgat tonle msrnplaomyde fyqh lh__=op tosFaorses linl toen/awertiltohyaainl $

Clearly this already is not Chinese. Dutch maybe? After leaving it all night, I scraped off the last megabyte of output and grepped the word "Microsoft".

j0 cing, I devaving navery of this.no of to completit Microsoft
ig lot an a him astoplest to idn't that on Microsoft which
machinastly, thot have beint.
hat toisien mewebles in a duch whokn your by Microsoft cisumenced
metwiously Microsoft and maker, you andly all can it ideally and un
This that's Microsoft that Pymase the everywny would see famed from
Microsoft" is say
log/2206==/xe-Zy.com/max-forhinct-radare/peg[Miczor menter thans, is
work Microsoft
gyeard that may this the of Microsoft not preater that oped
== wwa be you Microsoft. Bat Mich mishon; the transiling a be of the
interrentever Ararent have centloven bict and to me of Fur
Anjengor,  Heruend or about of - 10 or bit swould when, Microsoft
intereringer hat, benowary too
Microsoft "dolen/[book starre. My and
of at create Microsoft owight the drive smal(2.1332052/toxmLM//APre
that is worghow, swad the treasy
with so Microsoft]. I bear on justion, Gen, take off a web that I was
right will trickelt attemut how cutonced
worager some of conton do the presic learnise" in Microsoft a security
is such,
Microsoft shed dad verious g
haur * Cryside fill or quit us/angity to somerrowal_disive tays Goumen
and the spire watt on they can it Microsoft and the worme saprint as
beatir what bade itefend), Or Incless to do you've di
Microsoft annarattich
them)! I was ablest a that. It was need to Microsoft and use
n soidty Microsoft place othel of my besiverded-ranely Waz suc,
(lavents. It dilligid I've

As you can see it correctly noticed that I have a special interest in Microsoft. But consider that it didn’t just find Microsoft in the corpus. It learned about that by reading it. It learned to spell it and capitalize it. The rest of the text is pretty nonsensical, but it looks ok. From a distance it seems more representative of English than lorem ipsum. Certainly if I were planning generic page layouts of my blog, I’d use this text.

Another interesting feature this model picked up was that I like to link to Wikipedia. Here are some examples.

https://en.wikipedia.org/wikile=m/Fuctine/disting--ports." Herestive,
wikipedia.org/wiki/Dowemer-tho-[Ad3s. Pais I read to supperation of
most any midefining staccion

 s in every to riss "veres dasf extorty the stroge to insoftheiging
 bely aysned by I'm pomelarn drea to actually for this shill
 https://en.wikipedia.org/wiki-tomm[Bal tin's you've a open in a if

https://en.wikipedia.org/wiki/Todinefullws.com/pitn.com/Se=ftm* be
engy. I emalle


sisuate] ackird wwats mabion to at're)ds al and

https://en.wikipedia.org/wiki/Ponters-bubst.com/Mivi1.html[resebulary up

"https://en.wikipedia.org/wiki/gure(Lity definiteter
_________________________dopecoPrywgrmlesm=0/Vige: If yess, perprial



n soidty Microsoft place othel of my besiverded-ranely Waz suc,
(lavents. It dilligid I've
https://en.wikipedia.org/wikile=m/Fuctine/disting--ports." Herestive,

This is pretty astonishing. As you can see it gets the beginning right reliably, but when the specifics are required, you can see where the system transitions into its limitations. It often even tries to follow the URL with the bracketed anchor text (which is the Asciidoc syntax I use).

All good fun. Maybe one day I’ll figure out how to apply this powerful magic to the secondary structure problem. Others are more ambitious. The modern on-line assistants like Siri and Alexa are using RNNs, among other tricks. The idea behind an RNN chat bot is that the input is treated like a sequence and over time an output is learned. This works best on a limited domain situation (resolving simple tech support problems, for example). If you train it on a zillion requests made by customers and what the corresponding sequence of the call center’s response was, that, apparently, is kind of sufficient for the model to synthesize new answers to similar problems. Pretty freaky really.