Anagrams

:date: 2023-07-24 15:50 :tags:

This post has many purposes.

It illustrates an interesting time in history where LLMs are very strong and competitive with humans at certain tricky technical topics, but it's still a close race.

It will demonstrate a use case where some very esoteric regular expression magic is actually practical and useful. (Normal people can safely skip this part!)

And it will show you how you can use your computer's default unix tricks to solve anagrams.

I always read Merriam-Webster's Word Of The Day article and it often ends with an anagram puzzle. I can usually solve it just by the other clues but when I can't I like to use a simple bit of unix magic to give me a hint. On Linux (maybe Apples?) you can search for a regular expression like this.

grep '^[rtie]\+$' /usr/share/dict/words

In this case the anagram I'm looking for is something with the letters r, t, i, and e. This command will look through my system's word list and find words that contain one or more of these characters. I can further refine it by specifying I want 4 character words only with ^[rtie]\{4\}$. For sure the syntax is not pretty but it is succinct and effective. The problem with this exact strategy is that we will get not only correct words like rite, tier, and tire, but we will also get incorrect words like tree. However this strategy is usually sufficient to spot the solution and the puzzle is solved.

Sometimes however the puzzle turns out to be trickier. Today's was looking for a synonym for "evince" unscrambled from the following letters "tmiesafn". (I'm going to give the answer below, so solve it yourself now if you're keen to.) Using the strategy above gives me a command like this.

grep '^[tmiesafn]\{8\}$' /usr/share/dict/words

This produces 71 matching words e.g. "feminist", "satanism" etc. However only one of the matches uses these letters without duplicates and that can be difficult to spot. Is there a way to improve this and have it do what I really want?

For such low stakes tasks that are easily verifiable, you should now be starting to ask yourself if this is something a modern LLM AI assistant can help with. In the past it would not have been worth the effort for me to come up with a better solution to such a trivial problem but today, why not? There are a lot of problems like this which in the past just weren't worth messing with but today are.

So I gave claude.ai a crack at it. And it instantly responded with 5 complete and entirely different strategies. All completely wrong. So wrong, that I could see that they were wrong without further checking. (Things like "pipe to uniq" which is a solution to some problems but definitely not this one.) How about ChatGPT? Starting with 3.5 it also stumbled and produced something that seemed plausible but smelled wrong to me. I went ahead and tested it and sure enough it was wrong. As I pointed out the problems, it devolved into a more appropriate position of humility admitting defeat.

GPT-4 was a different story. Its answer not only sounded confident and convincing, it was also correct. Here is the condensed solution.

WORDS=/usr/share/dict/words  # Your word dictionary path.
A=tmiesafn                   # The input anagram.
grep -P "^$(echo $A | sed -r 's/(.)/(?=.*\1)/g').{${#A}}$" $WORDS

Running this produced the correct solution which is "manifest". And normal people can appreciate that these AI agents are getting more clever than clever experts at stuff like this and stop reading here and have a nice day.

If you'd like to know more about what exactly this crazy regular expression is all about and learn about a situation when some of the most esoteric features of modern regexp engines are not just "nice to have" (or overkill!) but essential, read on.

What's happening here is that the characters of the anagram are being broken apart and sed is used to dress up each character with something like this: t becomes (?=.*t) and m becomes (?=.*m) and so on. This creates a regular expression that fully written out looks like this (which also works).

grep -P "^(?=.*t)(?=.*m)(?=.*i)(?=.*e)(?=.*s)(?=.*a)(?=.*f)(?=.*n).{${#A}}$" $WORDS

It was very clever of GPT-4 to know how to create this kind of regexp from the input string using sed, but what is this and why is it useful here?

This syntax is using a "positive lookahead" token for each character. It ensures that the expression contains one (and only one) of each letter. It's hard to explain completely because I personally can not think of any other compelling use case for it. I can't get GPT-4 to think of any either. We can both think of examples, sure, but it's hard to find an example that really needs this. By far the best example I've ever seen is this anagram solution!

The other thing to note — and why I totally ignore this kind of regular expression feature creep — is that the -P invokes PCRE (Perl Compatible Regular Expressions). This is a variant with fancier features. But since programs like sed, awk, and vim (and many others) don't bother, it makes learning this less worthwhile. But GPT-4 learned it, learned it really well, and did not forget about it when the perfect time to deploy it arose.

The take away here is that we're at a time where a lot of your small technical problems that you know (or suspect) are solvable in principle but annoyingly difficult in practice should now be reappraised to see if modern AI can effortlessly help. Some of these problems will just magically disappear.

Note however that anagrams are not completely solved generally by the solution above! Words that have multiples of the same letter fail with GPT-4's strategy. When I asked for possible solutions it struggled to produce an elegant command line approach. Like I say, we're at a transition time where it could go either way.