Thursday, January 31, 2013

Sky-entific Method

What's black and blue and red in-between, never touched but always seen?

The sky.

Someone gave me this riddle earlier today, in the sort of coincidence that would make me extremely skeptical if it weren't me that it happened to. Because I'd been working on this post, I was able to answer pretty quickly. Thanks to this post, I was not eaten and instead used the magic ring I found to escape.

So, the sky is black at night, blue in the day, red between the two. Why? This is a classic children's question, and most of us probably know the answer. In increasing order of scientific sophistication, the answers could be:

"Magic."

"Because of light." (my roommate said this when I asked)

"The air scatters blue light more than the other colors, so the sky looks blue."

"Short wavelengths (like blue) are scattered more by the air, so those wavelengths are what we see."

"Light in the air undergoes Rayleigh scattering, which scatters short wavelengths more than long ones."

You might note that I allowed an entire level of scientific sophistication for knowing the name of the scattering process. This is unfair, since knowing a name is worth absolutely nothing. Really, all the additional points are from being able to mentally hyperlink to the information in that Wikipedia article. If you tell me "Rayleigh scattering" and I say "what's that?" and you don't know, you don't actually know anything about it and then I'll get mad at you.

In any case, Rayleigh scattering. It's a process that occurs when light encounters particles that are much smaller than its wavelength. Visible light is on the order of 500 nanometers. A molecule of nitrogen (N2) is a couple of angstroms across, and an angstrom is .1 nanometers. Since nitrogen makes up about three-quarters of the atmosphere, it looks like Rayleigh scattering is just the right process for our purposes. When light encounters these particles it "scatters"--that it, it's reflected in different directions depending on what kind of light it is. The sky is whatever color it is because--even though there's no light coming from directions besides towards the sun--the light scatters back to you.

Here's the problem, or the puzzle at least. This is the formula for Rayleigh scattering:




I want you to ignore everything about this except the λ part. λ means wavelength--the distance between two peaks in the light wave. This is an inverse relationship, so as λ is increased, less light is scattered. Longer-wavelength light is red/infrared/microwave/radio wave. Shorter-wavelength light is blue/violet/ultraviolet/X-ray and so on. But it's not just λ--it's λ^4 power. If λ doubles, the amount of light scattered is reduced by 1/(2^4) = one sixteenth. That's a very strong relationship, and it's why the short wavelengths aren't just scattered a little better than the long ones. They dominate.

If short wavelengths scatter better than long ones, why is the sky blue? That is, shouldn't it be violet, since that scatters even better and is even shorter wavelength? This problem is based on a discussion I had with some of my astronomer friends, which was in turned sparked by this xkcd.

The most prominent theory is that violet light is scattered better than blue light, but that our eyes are not as sensitive to violet as blue, so we don't see all that scattered violet light. (I say "violet" rather than "purple" because I think they're technically different. I am not an artist or color scientist though. "Violet" is also a more fun word than "purple.") We also had a competing theory, which was that the difference is simply that there's a lot more blue light than purple to start with in sunlight, so when it gets scattered--even if it doesn't get scattered as much--we still end up with a blue color overall. Can we answer this question...WITH SCIENCE?

Well, I hope so, since we're not going to get this answer any other way. I did try to meditate on it but never got my brain clear enough to run Mathematica in my head.

Our first step: what sort of light comes out of the sun? The sun pumps out a lot of energy, in all different frequencies (frequency and wavelength are in one-to-one correspondence, so if I have a particular frequency, I also have a particular wavelength, just so we're not confused). You might think this would be a really complicated problem. After all, why does the sun shine? Nuclear fusion and stuff, right? Are we going to have to work out all the characteristic emissions of the nuclear processes in the sun, or what?

Thankfully, no. Physicists have a law that dictates how most objects radiate when they are hot. That is, an ideal system ("black body") with nonzero temperature (i.e., it has energy) will give off photons of different kinds of energy. Obviously the hotter it is the more likely higher-energy (lower wavelength/higher frequency) photons are. The exact science of how we calculate this is a bit too in-depth for me to derive here, but it was a major result in early quantum theory. In fact, it was the first theoretical result that suggested such a thing as a "quantum" (smallest possible piece) of light. (Before this, the math predicted an infinite amount of high-energy light. We call this the "ultraviolet catastrophe" which would be an excellent band name.) The formula is this:



Quick run-down here: h is Planck's constant (Max Planck discovered this formula), λ is still wavelength, c is the speed of light, and k is the Boltzmann constant, which is always used when you need to go from temperature to energy. This is a tremendously versatile law, by the way--plug in human body temperature and you'll find that you emit in the far infrared, hence the use of infrared light for night-vision.

We're going to use this to figure out the spectrum of the sun. How hot is the sun? Wikipedia tells me it is about 5778 K. (Temperature in thermodynamics = always in kelvins. It's also not "degrees kelvin" but that's fine, I make that mistake all the time too.) I plug that into Planck's Law and this is what part of that graph looks like:




The y-axis has units of--deep breath-- watts per meter squared per steradian per hertz. We're not going to worry about that, though, which I'll talk about in a second. What's important is that this is a pretty good match for the actual spectrum of the sun. You'll note that it's at its highest (this is only a part of the function, of course, but the black body radiation only peaks once) at about 500nm, or a little higher. In fact, the "visible spectrum" is roughly 400nm - 700nm. Your eyes are fantastically well adapted to the particular star they evolved around; aliens testing your color sensitivity could probably guess the temperature of the sun. So this is a good match for everyone's favorite star.

Now the question becomes, what about the scattering intensity? Well, I posted the formula for Rayleigh scattering above. Can I just put my numbers through that?

It turns out that that's actually more complicated than we need it. To do that, we'd need to calculate the actual intensity of sunlight falling on Earth--our black body spectrum accounts for the energy coming out of the sun per are on the sun's surface, which is really different--and then we'd have to define other parameters, like the angles involved, our distance from the scattering (...how far are you from the sky?), etc.

But consider that all of these things are a matter of multiplication. That is, I'd multiply by the surface are of the sun. Then divide by the fraction that falls on earth. Then multiply by the molecular polarizability of N2. Then multiply by...etc. What will all of this do to the graph?

The answer is: very little. It'll change the dimensions of the y-axis. What was once "1" might become "2." Or "3." Or "4.232343...", but none of these actually matter. The question we're actually concerned with only needs the shape of the graph to stay the same, and changing the scaling (or the units) of the y-axis will not actually affect that. The peak will still be the peak.

So here's what I did: I took our black-body spectrum, call it B(λ). And I made a new plot, of B(λ)/λ^4. There is some constant in front that we'd have to multiply to get a real, quantitative intensity. But, we don't really want that. We just want to know the shape of the spectrum. Sometimes the most valuable scientific instinct is knowing what not to worry about.



We see here the peak is shifted very far back--Mathematica tells me that it's now at 276.45 nm, well into the ultraviolet. In reality, much of that may be mopped up by the ozone layer, I don't actually know. However, it seems that the visible spectrum will indeed be dominated by the very low wavelengths. That is, it will be more violet than blue. I feel that this lends considerable credence towards the theory that the sky is "actually" violet, in that its spectrum peaks there. If you saw the early-400nm light and only that you would see violet, but your eyes don't respond to that as much, so (hypothesis:) that's why the sky is blue.

But we don't have to stop there. If this is the actual spectrum, shouldn't it correspond to the color of the sky? Can we check that? (Scientific method note: got a theory? Make a prediction, and test it. That's pretty much the whole thing.)

That's a fascinating question. Color vision is exceptionally interesting, and it's important to remember that your eyes are not a spectrometer. You don't see a spectral curve, or the peak color of a spectral curve. The basic way this works is that you have three kinds of light-sensitive retinal cells in your eye, called cones: S, M, and L. Each of these responds to different wavelengths, with peaks in the blue, green, and red respectively. However, importantly, they respond in a curve--red might peak at 575nm, but it responds pretty well to 550nm and even 525nm. This is what the response curves looks like, from Wikipedia:


File:Cone-fundamentals-with-srgb-spectrum.svg

These are not mathematical functions, and I can't work with "pictures on Wikipedia" formulaically. But I'm modeling a phenomenon, so I'm going to pick some mathematical functions and pretend those are the response curves! Science! (Science recipe: make stuff up and if it doesn't work, make up different stuff until it does.) To me, those look like bell curves. Or Gaussian distributions, or normal distributions, or whatever you want to call them. So I made my own curves with the same peaks, similar widths, and I played around with the amplitudes so they'd all have a maximum value of 1.0.



Why, you might ask, would do this, Zach? And I would say that this is the key to converting a spectrum of light into a perceived color. And you might still ask why I would do that but I suspect you wouldn't be reading if you were prone to that sort of feeling. But seriously, this is the key. What I did was multiply the spectrum by the sensitivity curve. I think a diagram will make this most clear:



The black line here is the portion of the scattered spectrum that falls in the visible range, and the green line is the sensitivity curve for green that we calculated earlier. The purple line is formed by multiplying the two together. If there was not much scattering but the cone was highly sensitive, we'd still get a lot of response, and vice versa. I take the area under the purple line and I call that the "green response." If I repeat this with the other two cones, I have a guess at how they will respond to a given spectrum.

Now, the Red-Green-Blue color model expresses all colors a mixture of those three base colors. I calculate the largest value the spectrum reaches on the visible range, and I ask what would happen if I integrated this across the entire curve. This gives me a maximum possible value for the response for each cone. (I need to do the "largest value" thing so I can have a scale for the problem.) Then, I divide the real response by that maximum one, and I have my RGB value for that color. There's a bit of finagling here. RGB is a computer color model, and not necessarily exactly how the biology of the eye actually works. Again, we're throwing science at the wall and seeing what sticks.

Finally, I have one more problem that's worth addressing. Remember how we threw away the intensity scaling at the beginning, because it was too difficult? Well, it's back to bite us now. You see, this tells us the ratio of red to green to blue--but it doesn't tell us anything about their actual values per se. For example, I can't tell the difference between 255,255,255 (white) and 0,0,0 (black). What I'm lacking here is brightness, essentially.

In some sense this is a failure of the method. But it actually can still be useful. What we have here now is a range of colors, instead of just one, because I can aways multiply my base ratio to get a new color with the same ratio. If I have 1,2,3 then I also have 50,100,150, for instance. 

Enough talk. I'd like to show you the colors I got for our spectrum. This is a range of five of them, from dark to light.



:)

Looks like the sky to me. You can even see the gradient, which is noticeable during the day. Sky near the sun has more light in general and looks like the lighter blue; sky far from the sun is only scattered and has less light total and so looks like the darker circles.

Can I recap? Can I talk about how awesome this is for a second? What'd we know at the start? We knew how hot the sun was. We knew the dependence (not even the exact factors) on wavelength of Rayleigh scattering. And we knew how our eyes responded to light. And yet, when I wake up, and go outside tomorrow morning, I'll be able to look up and say..."yeah, that makes sense." I find the extrapolation of reality from these sorts of starting variables and logic to be probably the coolest thing in the whole world.

Some caveats here. I was always a believer in the biological explanation for why the sky was blue--that is, I thought it would come down to receptors and not the increased amount of blue light in the starting spectrum. Along the way, you can see that I made many approximations and guesses. (And I don't even show you all the things I tried that just did not work at all.) With this much tweaking, its possibly my expectation/desires influenced the result, which is why it's important that I've told you what I did and why I did it at every step. 

Finally, just because I came up with an explanation doesn't mean that it's the right one. Maybe my ozone comment earlier is quite important--maybe in reality there's a totally different scattered spectrum because the ozone absorbs too much short-wavelength light before it is scattered. (You can look this up if you want. Exercise for the reader.) If you have a criticism or an idea of how it could be done, I would absolutely love to hear it. Mostly I'd love to know that someone read it closely enough to critique it, but scientific curiosity is also welcomed and all that.

When I have children and they ask why the sky is blue, I'm going to knock that one out of the park.

As always, the computer file used in computing these figures is available publicly. You will need Mathematica to run it. I was using Mathematica 8, but I know nothing about its backwards compatibility. (Or forwards compatibility? My school doesn't have Mathematica 9 yet.)

http://dl.dropbox.com/u/32594622/solarspectrum.nb

Sunday, November 18, 2012

Math In The Time of Cholera

I got enough nice comments on the last post that when I was struck by a piece of curiosity I decided to go for another. To the extent that this blog has a purpose, it is essentially going to be my tinkering, which may or may not involve programming, but will (I imagine) pretty much always involve some form of mathematical modeling.

This week, I've decided that what's cool is evolution. In particular, antibiotic resistance. We all know that antibiotics, as they are (over)used, eventually fail to work. As bacteria are exposed to them, they evolve around them, learning to survive the deluge of drugs. Although I understand that this is a serious health problem for many, somehow I can't help but feel a smidgen of respect for these microorganisms, or at least for the machinery of life that enables their adaptation. As a great chaos theorist once said, "Life finds a way."

Disclaimer: Last time, I put up a disclaimer saying I didn't know anything, really, about cryptography. I would like to similarly note that I'm quite ignorant with respect to microbiology. At least with cryptography, I was staying in the realm of my academic training, which is highly mathematical. What I'm doing here is mathematical, but it's really a biological application, and the math is just how I approach it, because that's what I'm comfortable with. I worry that here I enter a long tradition of physical scientists who think they know what they're talking about. Hopefully I can get away with it by being self aware: I am not a doctor, I am not pre-med, I am not in the life sciences. The analysis and programs that follow are ridiculously simplistic and took me only a few hours, and are not to be taken seriously by anyone, including me. Feel free to point out all of the failings in this analysis.

So, anyway, we all know what antibiotic resistance is. When I examined the model I had in my head of antibiotic resistance, though, I discovered that it didn't really hold up, probably because I constructed most of it when I was ten. In my head, I imagined two kinds of bacteria--"resistant" and "vulnerable." When exposed to antibiotics, the resistant ones lived and the vulnerable ones died. Afterwards, only the resistant ones were around to produce offspring,  so the population as a whole would be resistant.

The flaws in this model come about when I considered the frequent advice that we hear: always finish your antibiotic regimen, because that helps prevent resistant bacteria from arising. This doesn't make a whole lot of sense in the context of my previous mental model. If resistant bacteria are all already resistant, then taking more antibiotics (or antibiotics for longer) won't affect them. In fact, arguably, you'd just be applying more selective pressure to the population, leading future generations to be even more resistant. 

The problem arises because I conceived of bacteria as falling into two categories, a binary distinction between "dies when exposed to antibiotic" and "lives when exposed to antibiotic." Perhaps, instead, it would make more sense to talk about probabilities. After all, often in toxicity we hear about "LD50" -- the lethal dose that kills 50% of a population. No antibiotic is 100%, right? Even bleach, alcohol, and other chemicals so strong that bacteria just straight-up do not evolve around them advertise "kills 99.9% of germs!"

In that last paragraph, I switched from talking about words ("resistant" and "not resistant") to talking about numbers ("how resistant?") which is my personal sign that I could analyze this for real. My goal was this: I wanted to watch a virtual population of bacteria evolve antibiotic resistance. 

The first part of any mathematical modeling is abstraction. For example, if i calculate the trajectory of a baseball, I will treat the baseball as if it were a point mass. I will ignore air resistance and how fast it spins in flight and the gravitational effects of the moon. Of course, these things can factor into an analysis, if I need a very detailed or comprehensive answer--but in general, why worry about them? Similarly, I don't need to do a real simulation of a human body full of bacteria, or to worry about how the antibiotics get there, or anything like that. I want to abstract out so all I have to worry about is the actual resistance, since that's what I'm interested in.

To do this, I organized my program like so: I have a list, called Petri. Petri is a petri dish, except obviously not a real one because doing biotech in my bedroom is a good way to give myself a terrible disease on accident. Petri is a list of objects which are just called "cells." (An aside, to any programmers reading, this was my first ever delve into the object-oriented world, with "cell" the first class I've ever defined. I had read about the concept and so I had some idea of what I was doing, but still. Milestone.) A cell has two associated variables. One variable "alive", which is either "true" or "false" to tell me whether the cell is alive, naturally. Another variable is "resistance." Resistance goes from .05 to .95, and represents the chance that the cell survives a dose of antibiotic. 

Resistance is an exceptionally abstract statistic, and I'm somewhat proud of how much information it carries in it. Everything about the interaction of the cell with the outside chemical is contained in that number. If you took a larger volume antibiotics, all that would do is change resistance. If you took them more frequently, I would just have to apply a dose more frequently. It doesn't matter how the antibiotic works, or how it's distributed, or anything. One thing I have thought of that it doesn't account for is this: suppose all the drug does is stop replication, rather than killing? That bacterium still takes up space on my virtual petri dish, but doesn't replicate (presumably eventually dying of, I dunno, entropy). To work that in, I'd have to allow for bacteria that don't die but also don't replicate, and so continue to take up space for a little while, etc. An interesting future extension, maybe.

Ok, so now that I have my list of cells, what do I do with it? Well, I can "dose" the dish, which makes all the bacteria roll a die generating a number between 0 and 1, and if that number is higher than their resistance, they die. I can also call a function called "generation." Generation means that every bacterium that is still alive makes a copy of itself into one of the empty spaces. (The number of total spaces is fixed and initially every space is full of bacteria, so the population is limited by its starting total.) I also want to be able to keep track of how the population is doing, so I have a function that can count the number of remaining live cells and another that calculates the average resistance of all the cells, which is interesting because this should increase as doses continue.

There are several arbitrary aspects of the simulation I'd like to talk about, because there are parameters that I essentially pull out of thin air. One of these is the generations/dose ratio. How many generations pass between doses? A thousand? Two? Ten? Manipulating this turns out to be very interesting, more on that later. Another is the initial resistance distribution. I have to pick a resistance for these guys to start with, after all, before any selection can occur. Also, they'll have to have different resistances, because otherwise there is no natural selection.

For the trials I'm about to show you, the generation/dose ratio was 2/1. That is, my program counts time in steps, t = 0, 1, 2..., and there is a new generation every step, and a dose of antibiotic applied at every even step. Again, this is arbitrary. I also started with an initial resistance distributed exponentially, with the average resistance at .15 or so. Also, there are 1000 spots in my petri dish. (This is arbitrary too, but it needs to be large enough that statistics works, but small enough that the program doesn't take too long to run. It is, therefore, limited mainly by patience, and more would be better.)

Anyway, here are the results. First, let's look at a chart of how many cells were alive:

Notice what happens here. First, the population drops extremely quickly. It makes a few small rebounds and is driven further down--and then there's a turnaround, and the bacteria win from there on out, their growth outpacing the antibiotic. Eventually they reach saturation, and every dose of the antibiotic does indeed drive them down--but their reproduction is fast enough to recover before another dose comes. Notably, they are consistently driven down to about six hundred or so. Let's look at what happened to the average resistance during the simulation:

First of all, it obviously rose quite a bit--remember our starting average was on the order of .1, and we ended around the .6 mark. That aligns perfectly with the survival rates we saw at the end of the simulation But why didn't it go higher? Well, the fact is that the simulation only has certain values in it, and so it appears that eventually all surviving cells had a common ancestor with that resistance, so it stops there. 

That reveals a flaw in the program that goes with another flaw I had already noticed and decided not to tell you about. First: no new traits emerge in this. Natural selection operates on the resistances that are already present, but if there was no resistance .9 in the sample, nothing will ever generate that high of resistance. Second: I don't think I have a very good distribution. Every biological trait I can think of tends to be distributed normally--on a bell curve, that is. But if I had a bell curve centered on .1, I might not have any of high resistance at all, and then my colony could never evolve that, so I cheated and used an exponential distribution.

Fixing one of these problems requires fixing the others, so here's what I'm going to do: First, I am replacing the exponential distribution with a bell curve centered on .15, with a standard deviation of .03. (On a normal distribution, 99.7% of samples lie within three standard deviations, so virtually all of our bacteria will lie between .06 and .24.) Second, I am introducing a new stat that each cell carries called "mutability." A cell's mutability ranges from 0 to .1, and when that cell reproduces, its daughter cell's resistance is modified by a random percentage of its mutability in either direction. That is, a cell with mutability .1 and resistance .2 will have daughters equally likely to be anywhere from .1 to .3. (I could make this a bell curve as well, I suppose, but it's not as crucial.) This will allow the population to produce resistances not present in the initial sample as the offspring may mutate to have better resistance than their parents.

Now, with our new simulations, here are the outputs:


Oh man. Ok, I kind of make a life out of geeking out about this stuff, but still, you gotta admit this is cool. First, look at the cell counts. Much like the others, it drops precipitously, recovers, reaches saturation, and oscillates as dosing continues. Unlike the others, it doesn't get stuck. Every time the dose applies, another round of natural selection occurs, until eventually the population is maximally resistant--which I set at .95, because it didn't seem all that realistic to allow perfect resistance. (In development, before I put that limit in, the resistance would go to 1.) Especially cool is the average resistance chart, because we started with a very tightly clustered distribution, but it doesn't take very long before every member of the population is more resistant than the entire original generation, which I think is pretty sweet.

Alright, until now, I've only shown you depressing graphs in which disease is nearly beaten back and then roars to a maximally-resistant dominance over feeble medicine. Luckily for you and I as humans, this is not inevitable. The variable I'm going to tweak is the generations/dose ratio, and I explored this by trial and error. If we flip it--two doses per generation--the bacteria die out in less than a dozen generations. I wanted to find something close to the tipping point, so I experimented a bit more, and it turns out we were never far from the tipping point--if there are three generations for every two doses, instead of four, the antibiotic wins. This only takes twenty-something generations, because there is no dramatic turnaround--only the cold hard victory of medical science:


Note that the resistance still goes up--the bacteria are, in fact, evolving--but because we finish our antibiotics they die and their resistance never escapes. If you stopped taking antibiotics, you would indeed potentially birth a new population of more-resistant (if probably not totally-resistant) bacteria. Repeat this a few times and you could well end up with maximally-resistant germs, at which point you are well and truly screwed.

Since I'm so big on the power of mathematical abstraction, let me point something out that represents the other side, that is, the flaws in this. My "doses" and my "generations" are quite abstract--in real life, antibiotics have a continuous effect if you're taking them constantly, and bacteria don't all divide in lockstep. This is fine for my extremely-simplistic model, but it means that ideas like "doses per generation" don't necessarily translate into real-life terms very well. In other words, don't read "doses per generation" as "pills per hour." Still, the basic concept--you need enough drugs to overcome their reproduction, and you need to stay on them long enough to actually wipe out the population--is there.

I want to point out that there are several other interesting things I could manipulate here--one would be starting resistance, one would be mutability range. Biological modeling is a real field, obviously, and I don't really have time this afternoon to reinvent it. There is one other experiment I do really want to run, though, and it didn't occur to me until I was typing this. I want to make mutability, uh, mutable. Before, a cell was exactly as mutable as its parent--now, I'm going to modify that number by saying that each cell's mutability can change...by its mutability. So mutability could as much as double, or it could drop all the way to zero. I don't know what will happen here exactly, but I have a hunch I'm highly interested in.

Neat. Super neat. Check it out: mutability at first increased, which makes sense because when you're under attack by terrible deadly chemicals it makes sense to be flexible. These cells found it favorable to be changeable, in other words. But at some point, the cells were as resistant as they were going to get, and at that point, mutations became a detriment. If there's nowhere to go but down, you're better off staying still. So mutability fell and continued falling for the rest of the simulation. (This is what I expected and so it makes me feel really cool, though I could easily lie about that. You'll have to trust me.) Again, we see the power of abstraction: real mutability presumably depends on, I dunno, DNA transcription error correction and stuff that I have no idea about. But because I looked at behavior (rather than mechanisms of behavior) I can still see these trends even though my "cells" are nothing like the real thing.

Natural selection: ain't life grand? That's kind of how I feel at the end of this project. I wanted to watch simple digital evolution, and I did. I hope you enjoyed coming along for the ride, if you've read this far.

Once again I'm attaching a tarball of my Python code. It is not very fast. (I can see ways to optimize it but don't really need to at this point, because, ah, I'm done with it for right now.) It is probably not very good. But, here it is, if you are interested.

http://dl.dropbox.com/u/32594622/antibio.tgz

wait a minute don't i have a test tomorrow

Tuesday, November 13, 2012

Codes and Keys

You know what's cool? Cryptography. Cryptography is SO DAMN COOL. Cryptography is cool, I think, because it sits in this really neat spot between math and language and computer science. And, of course, if you're like me, you like playing with ideas of information and all that. The idea that a text could be full of information utterly unreadable to most people is fascinating. I don't remember not being able to read, because I learned to read very early. But there's something enchanting about the familiar symbols of literacy--word, space, letter--twisted into something incomprehensible to you. Simple cryptograms, the kind you get in newspaper puzzle columns, have these disparate patterns that some part of the mind latches onto, trying to twist the letters back into the words you know and love--is that "ads" a "the"? Would that explain the "adsks" later on in the sentence?

Of course, if you think the idea of turning information into apparent gibberish and back is interesting when you have the key, I think you probably can't help but be enchanted by the idea of breaking codes. I recently learned to pick locks; the psychological drive is no different. Actually it's not that different from scientific curiosity. This is a pattern, there is information here--how can I tease it out? Pushing pins, analyzing letter frequencies, solving equations: these are just more ways of me asserting my mind on the unknown or (for locks) disallowed. Anything I don't understand and anywhere I can't go are limitations, and what can you do with a limitation but try to think your way around it?

Enough philosophizing, let's talk crypto.

Disclaimer: I don't actually know a whole lot about cryptography, I'll admit. I have read Neal Stephenson's Cryptonomicon, and that's probably the most advanced text I've gotten on the subject. I once read Dan Brown's Digital Fortress but I think that counts negative. Cryptographers: if I misuse terminology in this blog post, I apologize. This is certainly an example of me just fooling around.

My recent interest in cryptography started in my intro-level Botany class on Monday. (We all need our Gen Eds.) I was bored, and the professor was talking about evolution. I don't know if it was evolutionary algorithms or genetic codes or what, but I started thinking about codes. On the back of last week's exam, I scrawled out a simple rotating cipher. For those who don't know, what you do is write out the message, then write the key underneath it. Then, you "add" the letters together, numbering them 1 to 26. So A + B = C because 1 + 2 = 3. A + I = J because 1 + 9 = 10. If you go over 26, you divide by 26 and take the remainder. (Math people will recognize this as "mod 26" arithmetic.) The message "MESSAGE" encoded via the key "ABC" would look like:

MESSAGE
ABCABCA
NGVTCJF

So "MESSAGE" becomes "NGVTCJF." This code has some good properties. A simple cipher, where you just exchange each letter with one other letter, can be broken easily by picking out common words, or just counting the letters--"E" will appear the most. But in our code, "E" maps not to one letter but to several--so that attack can't be done as simply.

Once I started thinking about this I thought several things, kind of all at once.

1) It would be really easy to teach a computer to encrypt and decrypt messages this way, given a key.
2) It wouldn't be that much harder to teach a computer to break these codes, perhaps. More on the method 

One more complication: I decided to write this all in Python. This is a complication because I didn't really know any Python until I decided this--I had installed it once and flipped through the beginning of a tutorial, but had never motivated myself to actually learn it. My default programming language is Fortran 90, which is what I use to do research. (yes, yes, Fortran. Laugh, why don't you?) Thinking through how to do this in Fortran, I realized it might be time to pick up a more modern language. I've attached all my code--if it looks hilariously awkward or weird, please blame my inexperience with the language. I have no doubt it could be waaaay more elegant.

The first thing I needed to write was a dictionary. I needed to define a map from letters to numbers. But wait. There's more to writing than letters, right? I mean, for one, there are uppercase and lower letters. Also, spaces, periods, commas, semicolons, dashes...you get the idea. I opted for simplicity--all letters are treated as lowercase. I added periods, commas, dashes, question marks, and spaces. Anything that wasn't in those categories got translated to a space. Once those were numbered, I was done--I could take a number and get its corresponding character, and vice versa. I could also put in a string of characters--a word, a paragraph, a page--and get a list of numbers out, and vice versa. That's good! Computers are better at working with numbers than letters, so I needed to be able to convert that. (All these functions are in dictionary.py, if you want to see the code.)

Next up, the actual encryption program. This part is pretty simple, actually, once the dictionary was done. Take in a string. Convert it to a list of numbers. Take a key. Convert it to a list of numbers. Add them together, meaning you get a new list of numbers. Turn that into a string. Boom, encryption. Make that addition a subtraction, and it's decryption. Let's see an example. I'll encrypt the first line of the Gettysburg address with the key "lincoln"


>>> encrypt.encrypttext("Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.","lincoln")

Which outputs:

'qwctn-pzzrboyqk.rxsymemntbknrwmqd?mqibjs?akj qdru hsqa ukw.bcsv-hpq, vym.vkknkvrynyn q,pkkpzvpgwbrohvpnwvmm vhhmlvqbrpqtknvsom wmvvpm.z,r?-v q,pn ul,mczwmxm.bo?rkk go rohrsdlyg'

Since I have the key, this is easily decrypted:

>>> encrypt.decrypttext(_,"lincoln")
'four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.'


Notice that this is now in all-lowercase. This is very fixable--all I have to do is expand the dictionary--but I'm not worried about it right now.

Ok, well, that's cool I guess, if you like nerdy shit. (I do.) But suppose we stole this encrypted message off of one of our enemies, and we want to break it? We could try to see what the most common characters are--the problem is that what means "e" in one place means something different in another, so our letter frequencies won't be right.

Here, we apply a bit of insight. "E" doesn't map to the same letter every time--but every seventh "E" will, since "Lincoln" has seven letters. In fact, if we looked only at every seventh letter, then we'd be adding "l" every time--and that is a very simple cipher to solve, no more difficult than a newspaper cryptogram. While we won't have any particular words we can pick out, we don't actually need our encoded text (ciphertext) to have any meaning--we just need it to display standard English letter frequencies. Since English letters don't have a strong pattern every seventh letter, a "subtext" (my word--I don't know the real crypto word for this) of every seventh character works fine.

So, here's what I came up with: suppose I know the key length. Suppose I know the key is a seven-character string. Then I look at every seventh character. The most common letter repeated must stand for the most common character--which is actually not "E", but space, sorry for the deceit. Let's say the most common letter is "cipher-space." Cipher-space could be any character in my dictionary, but I know that it is formed from

"Space" + [first letter of the key].
I subtract
Cipher-space - Space = [first letter of key]

And I've found the first letter of the key! To find the second letter, I just add a letter of offset, so I'm looking at the second, ninth, sixteenth...etc. I repeat this until I'm done with the whole key.

Ok, well, that is a lot of nice theory. Does it work? To find out I had to first write a program that would generate letter frequencies from a string. That was neat to program but not particularly interesting, cryptographically, so I won't talk about it. Then, I wrote a program that implemented exactly the procedure I detailed above. Let's see how it does on our earlier example.


'qwctn-pzzrboyqk.rxsymemntbknrwmqd?mqibjs?akj qdru hsqa ukw.bcsv-hpq, vym.vkknkvrynyn q,pkkpzvpgwbrohvpnwvmm vhhmlvqbrpqtknvsom wmvvpm.z,r?-v q,pn ul,mczwmxm.bo?rkk go rohrsdlyg'
>>> breakcipher.findkey(_,7)
'liscoan'

(The "_" symbol refers to the previous output--so it is just plugging in the ciphertext) Well, that didn't quite work--notably, it did ALMOST work. Here's a problem with our method, then: it doesn't work on short amounts of text. A seven-character key means that our message got cut down to sevenths, and then we tried to do statistics on that. Understandably, that's prone to error. Suppose we try the entire Address?

>>> encrypt.fencrypt("gettysburg.txt","gettysburg.txt","lincoln")
>>> open("gettysburg.txt").read()
'qwctn-pzzrboyqk.rxsymemntbknrwmqd?mqibjs?akj qdru hsqa ukw.bcsv-hpq, vym.vkknkvrynyn q,pkkpzvpgwbrohvpnwvmm vhhmlvqbrpqtknvsom wmvvpm.z,r?-v q,pn ul,mczwmxm.bo?rkk go rohrsdlygfkp?cmcmmcapmpvtcupqkq.bokt?mnvnnvbqybfl hhbgb vyomyvpbsm bcsn h.cct,yemqaknybmpo vzvmu?kpzvpgwbrohnprkazhqgrtpl,rfkkplvmn?ytkm.fd?rghegnl phzgck,yhnbu?rl,mdo bwmlhwpyoh,hn ul,myo?ik rbvldphpq.pm wmfsovnibgnlm.w vwz.kwsbcsn hskswqhhnunlmqq.czk p.bk,rm.tneskszzmvvzaphej?kupzrbuldphbjst ktvxs-m pnvn ul,mpo vzvmowru hykepikqbbw-mltbqupbsm bttb q.inl.oh?t?.r?hbjo mcmmuvzcwlmf?kbsqa?lioa,jbwymlhycarr?hag,-rhhegnnnyh.qckqplveo rkglbfpmni.b,zbkk,pbpp?ibgnjlk rbql.kv,vnsnwt,ynjlk,ukbkt?wcprgm prbp?nbmmosyjktvxwytki.fnorlljbfs,k.btdrtwmqbvp pemjobrkk,pbpp?ibgrkv emho?mlj,xsk,azmr?z kx,ys?m wmcromzzmfs  lkb?n upheqawqk vnzkyt,bnsk.z,r-ny,?hyq,rm?mzg.mr?hejo mcmmuoemsm gkkoa,mkckplvmpsbr?hsqarr hejo m pr.novohugapikqbbw-mqw bd-m prbztdtvt-n?n prtkkbzhognoroqpccpqkprtskbzhbjskcynvpw-uplmy??xk ukqsm pr.ncuzhsqdru hugapmsidgn ua.mho?m-wmp?myehnfel.nmq?ntbkqabalbsm btz k?abczmmmmjs?rklrfwnn mqbczm prbu?rl,mvo-xkzroot.tvtbppszzrbd-mjgmvvlbkn q.kbsmagns,yw grkqpiqbfpm ixgnt.nzrcbpqklrx? vzvmv?kbsibbqlc-mmh??mcpvevkbsmgbuldphbjskyl.bbtaywhzgo-c?mmqtkqp-,vwz.kglbcsn hegnsr?mmjwruwbmts-,w-rbcsn hbjs-rklrcrkasiynny, hucepmoqrfnt.k-nk,kljhbjo m pvunyn q,pkkcylrtnr,oemuvlywhucepmlh.gfkotzbjnzskn gso,xhlanl.ohbjo mrwdgayzpvbb?qm prb-p,.tr-nmgk,ugn.rzxygkkszzmvvpm.m,rzpjk.uczwmywbb-p t.ubt?,xhbjskrlzbjji'
>>> breakcipher.findkey(_,7)
'lincoln'

THERE WE GO! Awesome, awesome. We took a scrambled text, knowing only the length of the key, and we pulled out the key via frequency analysis. Pretty sweet, imo, but I think we could do better. I mean, why would we know how long the key is? Let's get real here. 

Ok, ok. Well. How do we do this, then? Suppose we thought we knew the key length. We could decrypt it, then, we know that. And then we could check whether we had it right. Obviously, we could do this by just looking at it--but the point of this is to make computers do all our work, isn't it? (It is.) What's the difference between a real, decrypted text and an incorrectly decrypted string of gibberish? Well, actually, we already have the answer: letter frequency! Gibberish doesn't use e's more than q's, or vice versa. Those are properties of English language, and that's what we're trying to end up with.

So, here's our algorithm: first try a single-character key, and determine what letter frequencies that would give. If the difference of that and normal English letter frequencies (drawn from some reference text, I used Project Gutenberg's Beowulf, but obviously there are lots of possibilities for this) is too much, try a two character key. Then a three character. Etc. We do this until the key gets up to some particular proportion of the total length of the text.

There are two arbitrary parameters here: first, how far off of letter frequency is normal letter frequency? I picked a total 5% deviation, but there's not strong science behind it other than "this works." I tried 1% but found that too stringent. Another question is: how long do we try? At this point my algorithm stops when the key is one-fifth of the size of the message. That's waaaaaaay too big, because that only gives us five characters in each subtext. So far, it stops before it gets that high because it solves the problem. But obviously there's no guarantee of that, and a (much) lower stopping point would be a better idea, but I don't know what. Sounds like a good time to do some math.

Again, that's all theory. Let's test it.

'qwctn-pzzrboyqk.rxsymemntbknrwmqd?mqibjs?akj qdru hsqa ukw.bcsv-hpq, vym.vkknkvrynyn q,pkkpzvpgwbrohvpnwvmm vhhmlvqbrpqtknvsom wmvvpm.z,r?-v q,pn ul,mczwmxm.bo?rkk go rohrsdlygfkp?cmcmmcapmpvtcupqkq.bokt?mnvnnvbqybfl hhbgb vyomyvpbsm bcsn h.cct,yemqaknybmpo vzvmu?kpzvpgwbrohnprkazhqgrtpl,rfkkplvmn?ytkm.fd?rghegnl phzgck,yhnbu?rl,mdo bwmlhwpyoh,hn ul,myo?ik rbvldphpq.pm wmfsovnibgnlm.w vwz.kwsbcsn hskswqhhnunlmqq.czk p.bk,rm.tneskszzmvvzaphej?kupzrbuldphbjst ktvxs-m pnvn ul,mpo vzvmowru hykepikqbbw-mltbqupbsm bttb q.inl.oh?t?.r?hbjo mcmmuvzcwlmf?kbsqa?lioa,jbwymlhycarr?hag,-rhhegnnnyh.qckqplveo rkglbfpmni.b,zbkk,pbpp?ibgnjlk rbql.kv,vnsnwt,ynjlk,ukbkt?wcprgm prbp?nbmmosyjktvxwytki.fnorlljbfs,k.btdrtwmqbvp pemjobrkk,pbpp?ibgrkv emho?mlj,xsk,azmr?z kx,ys?m wmcromzzmfs  lkb?n upheqawqk vnzkyt,bnsk.z,r-ny,?hyq,rm?mzg.mr?hejo mcmmuoemsm gkkoa,mkckplvmpsbr?hsqarr hejo m pr.novohugapikqbbw-mqw bd-m prbztdtvt-n?n prtkkbzhognoroqpccpqkprtskbzhbjskcynvpw-uplmy??xk ukqsm pr.ncuzhsqdru hugapmsidgn ua.mho?m-wmp?myehnfel.nmq?ntbkqabalbsm btz k?abczmmmmjs?rklrfwnn mqbczm prbu?rl,mvo-xkzroot.tvtbppszzrbd-mjgmvvlbkn q.kbsmagns,yw grkqpiqbfpm ixgnt.nzrcbpqklrx? vzvmv?kbsibbqlc-mmh??mcpvevkbsmgbuldphbjskyl.bbtaywhzgo-c?mmqtkqp-,vwz.kglbcsn hegnsr?mmjwruwbmts-,w-rbcsn hbjs-rklrcrkasiynny, hucepmoqrfnt.k-nk,kljhbjo m pvunyn q,pkkcylrtnr,oemuvlywhucepmlh.gfkotzbjnzskn gso,xhlanl.ohbjo mrwdgayzpvbb?qm prb-p,.tr-nmgk,ugn.rzxygkkszzmvvpm.m,rzpjk.uczwmywbb-p t.ubt?,xhbjskrlzbjji'
>>> breakcipher.decode(_)
'lincoln'

Success! I have functions now to encrypt, decrypt, and use frequency analysis to forcibly extract a key. Given a message which I know has been encrypted a particular way, I am one short script and a few seconds of waiting away from getting the actual secret message. Not too shabby for a couple night of screwing around in Python, I'd say.

Anyone interested in my (probably sloppy and amateurish, definitely uncommented) code can find it here: 

http://dl.dropbox.com/u/32594622/simplecrypto.tar

If you actually read all this, I salute you. This has been an episode of "Zach screws around with computers", and I had a lot of fun with it. If nothing else it kept my mind occupied through a lecture or two. If you can suggest some good websites/books about crypto, I'd appreciate that, and if you have comments or questions, I'd be happy to answer them.