Tuesday, November 13, 2012

Codes and Keys

You know what's cool? Cryptography. Cryptography is SO DAMN COOL. Cryptography is cool, I think, because it sits in this really neat spot between math and language and computer science. And, of course, if you're like me, you like playing with ideas of information and all that. The idea that a text could be full of information utterly unreadable to most people is fascinating. I don't remember not being able to read, because I learned to read very early. But there's something enchanting about the familiar symbols of literacy--word, space, letter--twisted into something incomprehensible to you. Simple cryptograms, the kind you get in newspaper puzzle columns, have these disparate patterns that some part of the mind latches onto, trying to twist the letters back into the words you know and love--is that "ads" a "the"? Would that explain the "adsks" later on in the sentence?

Of course, if you think the idea of turning information into apparent gibberish and back is interesting when you have the key, I think you probably can't help but be enchanted by the idea of breaking codes. I recently learned to pick locks; the psychological drive is no different. Actually it's not that different from scientific curiosity. This is a pattern, there is information here--how can I tease it out? Pushing pins, analyzing letter frequencies, solving equations: these are just more ways of me asserting my mind on the unknown or (for locks) disallowed. Anything I don't understand and anywhere I can't go are limitations, and what can you do with a limitation but try to think your way around it?

Enough philosophizing, let's talk crypto.

Disclaimer: I don't actually know a whole lot about cryptography, I'll admit. I have read Neal Stephenson's Cryptonomicon, and that's probably the most advanced text I've gotten on the subject. I once read Dan Brown's Digital Fortress but I think that counts negative. Cryptographers: if I misuse terminology in this blog post, I apologize. This is certainly an example of me just fooling around.

My recent interest in cryptography started in my intro-level Botany class on Monday. (We all need our Gen Eds.) I was bored, and the professor was talking about evolution. I don't know if it was evolutionary algorithms or genetic codes or what, but I started thinking about codes. On the back of last week's exam, I scrawled out a simple rotating cipher. For those who don't know, what you do is write out the message, then write the key underneath it. Then, you "add" the letters together, numbering them 1 to 26. So A + B = C because 1 + 2 = 3. A + I = J because 1 + 9 = 10. If you go over 26, you divide by 26 and take the remainder. (Math people will recognize this as "mod 26" arithmetic.) The message "MESSAGE" encoded via the key "ABC" would look like:

MESSAGE
ABCABCA
NGVTCJF

So "MESSAGE" becomes "NGVTCJF." This code has some good properties. A simple cipher, where you just exchange each letter with one other letter, can be broken easily by picking out common words, or just counting the letters--"E" will appear the most. But in our code, "E" maps not to one letter but to several--so that attack can't be done as simply.

Once I started thinking about this I thought several things, kind of all at once.

1) It would be really easy to teach a computer to encrypt and decrypt messages this way, given a key.
2) It wouldn't be that much harder to teach a computer to break these codes, perhaps. More on the method 

One more complication: I decided to write this all in Python. This is a complication because I didn't really know any Python until I decided this--I had installed it once and flipped through the beginning of a tutorial, but had never motivated myself to actually learn it. My default programming language is Fortran 90, which is what I use to do research. (yes, yes, Fortran. Laugh, why don't you?) Thinking through how to do this in Fortran, I realized it might be time to pick up a more modern language. I've attached all my code--if it looks hilariously awkward or weird, please blame my inexperience with the language. I have no doubt it could be waaaay more elegant.

The first thing I needed to write was a dictionary. I needed to define a map from letters to numbers. But wait. There's more to writing than letters, right? I mean, for one, there are uppercase and lower letters. Also, spaces, periods, commas, semicolons, dashes...you get the idea. I opted for simplicity--all letters are treated as lowercase. I added periods, commas, dashes, question marks, and spaces. Anything that wasn't in those categories got translated to a space. Once those were numbered, I was done--I could take a number and get its corresponding character, and vice versa. I could also put in a string of characters--a word, a paragraph, a page--and get a list of numbers out, and vice versa. That's good! Computers are better at working with numbers than letters, so I needed to be able to convert that. (All these functions are in dictionary.py, if you want to see the code.)

Next up, the actual encryption program. This part is pretty simple, actually, once the dictionary was done. Take in a string. Convert it to a list of numbers. Take a key. Convert it to a list of numbers. Add them together, meaning you get a new list of numbers. Turn that into a string. Boom, encryption. Make that addition a subtraction, and it's decryption. Let's see an example. I'll encrypt the first line of the Gettysburg address with the key "lincoln"


>>> encrypt.encrypttext("Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.","lincoln")

Which outputs:

'qwctn-pzzrboyqk.rxsymemntbknrwmqd?mqibjs?akj qdru hsqa ukw.bcsv-hpq, vym.vkknkvrynyn q,pkkpzvpgwbrohvpnwvmm vhhmlvqbrpqtknvsom wmvvpm.z,r?-v q,pn ul,mczwmxm.bo?rkk go rohrsdlyg'

Since I have the key, this is easily decrypted:

>>> encrypt.decrypttext(_,"lincoln")
'four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.'


Notice that this is now in all-lowercase. This is very fixable--all I have to do is expand the dictionary--but I'm not worried about it right now.

Ok, well, that's cool I guess, if you like nerdy shit. (I do.) But suppose we stole this encrypted message off of one of our enemies, and we want to break it? We could try to see what the most common characters are--the problem is that what means "e" in one place means something different in another, so our letter frequencies won't be right.

Here, we apply a bit of insight. "E" doesn't map to the same letter every time--but every seventh "E" will, since "Lincoln" has seven letters. In fact, if we looked only at every seventh letter, then we'd be adding "l" every time--and that is a very simple cipher to solve, no more difficult than a newspaper cryptogram. While we won't have any particular words we can pick out, we don't actually need our encoded text (ciphertext) to have any meaning--we just need it to display standard English letter frequencies. Since English letters don't have a strong pattern every seventh letter, a "subtext" (my word--I don't know the real crypto word for this) of every seventh character works fine.

So, here's what I came up with: suppose I know the key length. Suppose I know the key is a seven-character string. Then I look at every seventh character. The most common letter repeated must stand for the most common character--which is actually not "E", but space, sorry for the deceit. Let's say the most common letter is "cipher-space." Cipher-space could be any character in my dictionary, but I know that it is formed from

"Space" + [first letter of the key].
I subtract
Cipher-space - Space = [first letter of key]

And I've found the first letter of the key! To find the second letter, I just add a letter of offset, so I'm looking at the second, ninth, sixteenth...etc. I repeat this until I'm done with the whole key.

Ok, well, that is a lot of nice theory. Does it work? To find out I had to first write a program that would generate letter frequencies from a string. That was neat to program but not particularly interesting, cryptographically, so I won't talk about it. Then, I wrote a program that implemented exactly the procedure I detailed above. Let's see how it does on our earlier example.


'qwctn-pzzrboyqk.rxsymemntbknrwmqd?mqibjs?akj qdru hsqa ukw.bcsv-hpq, vym.vkknkvrynyn q,pkkpzvpgwbrohvpnwvmm vhhmlvqbrpqtknvsom wmvvpm.z,r?-v q,pn ul,mczwmxm.bo?rkk go rohrsdlyg'
>>> breakcipher.findkey(_,7)
'liscoan'

(The "_" symbol refers to the previous output--so it is just plugging in the ciphertext) Well, that didn't quite work--notably, it did ALMOST work. Here's a problem with our method, then: it doesn't work on short amounts of text. A seven-character key means that our message got cut down to sevenths, and then we tried to do statistics on that. Understandably, that's prone to error. Suppose we try the entire Address?

>>> encrypt.fencrypt("gettysburg.txt","gettysburg.txt","lincoln")
>>> open("gettysburg.txt").read()
'qwctn-pzzrboyqk.rxsymemntbknrwmqd?mqibjs?akj qdru hsqa ukw.bcsv-hpq, vym.vkknkvrynyn q,pkkpzvpgwbrohvpnwvmm vhhmlvqbrpqtknvsom wmvvpm.z,r?-v q,pn ul,mczwmxm.bo?rkk go rohrsdlygfkp?cmcmmcapmpvtcupqkq.bokt?mnvnnvbqybfl hhbgb vyomyvpbsm bcsn h.cct,yemqaknybmpo vzvmu?kpzvpgwbrohnprkazhqgrtpl,rfkkplvmn?ytkm.fd?rghegnl phzgck,yhnbu?rl,mdo bwmlhwpyoh,hn ul,myo?ik rbvldphpq.pm wmfsovnibgnlm.w vwz.kwsbcsn hskswqhhnunlmqq.czk p.bk,rm.tneskszzmvvzaphej?kupzrbuldphbjst ktvxs-m pnvn ul,mpo vzvmowru hykepikqbbw-mltbqupbsm bttb q.inl.oh?t?.r?hbjo mcmmuvzcwlmf?kbsqa?lioa,jbwymlhycarr?hag,-rhhegnnnyh.qckqplveo rkglbfpmni.b,zbkk,pbpp?ibgnjlk rbql.kv,vnsnwt,ynjlk,ukbkt?wcprgm prbp?nbmmosyjktvxwytki.fnorlljbfs,k.btdrtwmqbvp pemjobrkk,pbpp?ibgrkv emho?mlj,xsk,azmr?z kx,ys?m wmcromzzmfs  lkb?n upheqawqk vnzkyt,bnsk.z,r-ny,?hyq,rm?mzg.mr?hejo mcmmuoemsm gkkoa,mkckplvmpsbr?hsqarr hejo m pr.novohugapikqbbw-mqw bd-m prbztdtvt-n?n prtkkbzhognoroqpccpqkprtskbzhbjskcynvpw-uplmy??xk ukqsm pr.ncuzhsqdru hugapmsidgn ua.mho?m-wmp?myehnfel.nmq?ntbkqabalbsm btz k?abczmmmmjs?rklrfwnn mqbczm prbu?rl,mvo-xkzroot.tvtbppszzrbd-mjgmvvlbkn q.kbsmagns,yw grkqpiqbfpm ixgnt.nzrcbpqklrx? vzvmv?kbsibbqlc-mmh??mcpvevkbsmgbuldphbjskyl.bbtaywhzgo-c?mmqtkqp-,vwz.kglbcsn hegnsr?mmjwruwbmts-,w-rbcsn hbjs-rklrcrkasiynny, hucepmoqrfnt.k-nk,kljhbjo m pvunyn q,pkkcylrtnr,oemuvlywhucepmlh.gfkotzbjnzskn gso,xhlanl.ohbjo mrwdgayzpvbb?qm prb-p,.tr-nmgk,ugn.rzxygkkszzmvvpm.m,rzpjk.uczwmywbb-p t.ubt?,xhbjskrlzbjji'
>>> breakcipher.findkey(_,7)
'lincoln'

THERE WE GO! Awesome, awesome. We took a scrambled text, knowing only the length of the key, and we pulled out the key via frequency analysis. Pretty sweet, imo, but I think we could do better. I mean, why would we know how long the key is? Let's get real here. 

Ok, ok. Well. How do we do this, then? Suppose we thought we knew the key length. We could decrypt it, then, we know that. And then we could check whether we had it right. Obviously, we could do this by just looking at it--but the point of this is to make computers do all our work, isn't it? (It is.) What's the difference between a real, decrypted text and an incorrectly decrypted string of gibberish? Well, actually, we already have the answer: letter frequency! Gibberish doesn't use e's more than q's, or vice versa. Those are properties of English language, and that's what we're trying to end up with.

So, here's our algorithm: first try a single-character key, and determine what letter frequencies that would give. If the difference of that and normal English letter frequencies (drawn from some reference text, I used Project Gutenberg's Beowulf, but obviously there are lots of possibilities for this) is too much, try a two character key. Then a three character. Etc. We do this until the key gets up to some particular proportion of the total length of the text.

There are two arbitrary parameters here: first, how far off of letter frequency is normal letter frequency? I picked a total 5% deviation, but there's not strong science behind it other than "this works." I tried 1% but found that too stringent. Another question is: how long do we try? At this point my algorithm stops when the key is one-fifth of the size of the message. That's waaaaaaay too big, because that only gives us five characters in each subtext. So far, it stops before it gets that high because it solves the problem. But obviously there's no guarantee of that, and a (much) lower stopping point would be a better idea, but I don't know what. Sounds like a good time to do some math.

Again, that's all theory. Let's test it.

'qwctn-pzzrboyqk.rxsymemntbknrwmqd?mqibjs?akj qdru hsqa ukw.bcsv-hpq, vym.vkknkvrynyn q,pkkpzvpgwbrohvpnwvmm vhhmlvqbrpqtknvsom wmvvpm.z,r?-v q,pn ul,mczwmxm.bo?rkk go rohrsdlygfkp?cmcmmcapmpvtcupqkq.bokt?mnvnnvbqybfl hhbgb vyomyvpbsm bcsn h.cct,yemqaknybmpo vzvmu?kpzvpgwbrohnprkazhqgrtpl,rfkkplvmn?ytkm.fd?rghegnl phzgck,yhnbu?rl,mdo bwmlhwpyoh,hn ul,myo?ik rbvldphpq.pm wmfsovnibgnlm.w vwz.kwsbcsn hskswqhhnunlmqq.czk p.bk,rm.tneskszzmvvzaphej?kupzrbuldphbjst ktvxs-m pnvn ul,mpo vzvmowru hykepikqbbw-mltbqupbsm bttb q.inl.oh?t?.r?hbjo mcmmuvzcwlmf?kbsqa?lioa,jbwymlhycarr?hag,-rhhegnnnyh.qckqplveo rkglbfpmni.b,zbkk,pbpp?ibgnjlk rbql.kv,vnsnwt,ynjlk,ukbkt?wcprgm prbp?nbmmosyjktvxwytki.fnorlljbfs,k.btdrtwmqbvp pemjobrkk,pbpp?ibgrkv emho?mlj,xsk,azmr?z kx,ys?m wmcromzzmfs  lkb?n upheqawqk vnzkyt,bnsk.z,r-ny,?hyq,rm?mzg.mr?hejo mcmmuoemsm gkkoa,mkckplvmpsbr?hsqarr hejo m pr.novohugapikqbbw-mqw bd-m prbztdtvt-n?n prtkkbzhognoroqpccpqkprtskbzhbjskcynvpw-uplmy??xk ukqsm pr.ncuzhsqdru hugapmsidgn ua.mho?m-wmp?myehnfel.nmq?ntbkqabalbsm btz k?abczmmmmjs?rklrfwnn mqbczm prbu?rl,mvo-xkzroot.tvtbppszzrbd-mjgmvvlbkn q.kbsmagns,yw grkqpiqbfpm ixgnt.nzrcbpqklrx? vzvmv?kbsibbqlc-mmh??mcpvevkbsmgbuldphbjskyl.bbtaywhzgo-c?mmqtkqp-,vwz.kglbcsn hegnsr?mmjwruwbmts-,w-rbcsn hbjs-rklrcrkasiynny, hucepmoqrfnt.k-nk,kljhbjo m pvunyn q,pkkcylrtnr,oemuvlywhucepmlh.gfkotzbjnzskn gso,xhlanl.ohbjo mrwdgayzpvbb?qm prb-p,.tr-nmgk,ugn.rzxygkkszzmvvpm.m,rzpjk.uczwmywbb-p t.ubt?,xhbjskrlzbjji'
>>> breakcipher.decode(_)
'lincoln'

Success! I have functions now to encrypt, decrypt, and use frequency analysis to forcibly extract a key. Given a message which I know has been encrypted a particular way, I am one short script and a few seconds of waiting away from getting the actual secret message. Not too shabby for a couple night of screwing around in Python, I'd say.

Anyone interested in my (probably sloppy and amateurish, definitely uncommented) code can find it here: 

http://dl.dropbox.com/u/32594622/simplecrypto.tar

If you actually read all this, I salute you. This has been an episode of "Zach screws around with computers", and I had a lot of fun with it. If nothing else it kept my mind occupied through a lecture or two. If you can suggest some good websites/books about crypto, I'd appreciate that, and if you have comments or questions, I'd be happy to answer them.

No comments:

Post a Comment