Science Tutorials


Classification

Classification is a task where someone gives you a bunch of information, and then a label for each bit of information, and expects you to find what patterns in the original information explain the labels.

A British friend of mine was recently explaining the word "twee" to me, which kind of means overly self-serious, saccharine, and cute. To explain this, she gave me a list of things that are twee, Belle and Sebastian making the pattern pretty apparent. For the rest of the day, any time I heard a song or saw something, I put it into the category of "too-twee" or "not-too-twee" (that is the question), and she told me when I was right or wrong. This is the task of classification. It's like putting all of the twee things on one pile, all of the non-twee things on a pile and trying to find some kind of wall that separates them. Here is a picture of Thom Yorke doing just that, like some sort of data analysis Moses:

For music, knowing what side of the wall a new song is on might involve asking the questions, "How fast is the tempo?", "Does the singer close his/her eyes when singing?", or "Do I only listen to this music when I'm having a break up?".

Clustering (with applets!)

A related problem to classification is the problem of clustering: given data without labels, come up with labels that find spatial patterns in the data. In this scenario, you don't have any labels, just the songs themselves, and you're supposed to put them into groups that fit them well. This was the problem faced by the person who invented the concept of twee, or by anyone that ever says anything like, "You know there are 2 kinds of people..."

As you might guess, the problem is kind of ill defined. If you were left with a bunch of music, how would you group it? How many groups would you make? These are actually problems that haunt scientists working on this kind of thing. On one extreme, you could see all the data as one big group, just a group of songs. On the other hand, each song is different, so you could really have a group for every song. Obviously these cases are not very good, so the best choice must be somewhere between them. The groups that you come up with are called clusters.

A standard approch to this problem are the k-means and k-centers algorithms. Essentially, what these algorithms do are initially choose random songs that will be the "center" of each cluster, and then assign songs to the cluster where they most closely belong. Then a new center is chosen in the middle of each cluster, and the songs are reassigned. This continues until there is no change.

Click here to play with it (in a new window)!.

Another approach to this problem (which is relatively new) is to treat the songs like candidates in an election. Now the songs each get to vote, and are each potential candidates. Obviously, no candidate represents your own interests more than yourself, so songs are very likely to each spend all of their vote on themselves, which will result in each song being its own cluster, which is not very helpful.

Have you ever seen the Simpson's episode where the two aliens pretend to be the Democratic and Republican nominees for the US election? After being unmasked, the aliens, unworried, say, "It's a two party system! You'll have to vote for one of us!" Everyone looks terrified, but a man in the crowd finally says, "I do believe I'll vote for a third-party candidate!" to which the aliens laugh and respond, "Go ahead! Throw your vote away!" And, just like in real life, the aliens win and everyone suffers.

The creators of this algorithm used that kind of reasoning. Songs should vote for songs that are not only similar to themselves (and therefore good candidates), but they should also vote for songs that have a chance of winning the election at all. So a good candidate should be similar, but should also have a lot of votes from other people. It's a lot like the way a primary works.

Click here to play with it (in a new window)! (careful, it can get kind of slow in my implementation).

Clustering by Passing Messages Between Data Points
Brendan J. Frey, et al. Science 315, 972 (2007)
Read the paper in PDF format

DNA Sequencing

Because DNA is so important, we're very interested in reading its sequence, but unfortunately our eyes are not quite keen enough to read off its little letters.

Fortunately, there are some tricks we can use to get around the fact that our eyes will never be so good.

First of all, DNA has a charge. So if you were to put an electric field across it, the DNA would start to migrate toward one side. And if you put the DNA into jello and then put an electric field across it, bigger pieces would have a harder time snaking through the jello. And surprisingly enough, if you left a bunch of pieces of DNA hanging around in jello, in an electric field, eventually the longer pieces would be at one end (not having moved very much), and the shorter pieces would have moved almost all the way across the jello. Jello Pudding Pops, being both more delicious and a horrible choice for the job, should not be used. So first: electric field, jello. K? K.

Secondly, DNA is like a big zipper. And if life has taught us anything, it's that if you heat things up, zippers tend to come undone. Rawr.

Third, there is a very handy enzyme called DNA polymerase that copies DNA. When pieces of DNA are unzipped, DNA polymerase loves driving along and laying down the complementary bases. If you ask it, "Hey DNA polymerase, want to get ice cream?" it will continue along, uninterested in anything but the prospect of putting down more complementary bases.

Clever scientists, particularly a man named Sanger, exploited these facts to figure out how to help read the sequence from DNA. First you start with a bunch of copies of the DNA you want to read. Then you put in bases (DNA polymerase needs DNA bases to use). For simplicity, imagine that you put in regular G, A, T, and C, but you also put in G's that have a big green fluorescent... thing hanging off of them. And A's that have a red, um... thing. And T's that have a blue, wait it's called a fluorophore. And C's that have a yellow fluorophore. Well these fluorophores are kind of big, and tend to give DNA polymerase trouble.

Now you put in the polymerase. If the original pieces of DNA had the sequence GATTACA (which I still haven't seen, but hear good things), then the polymerase will start to make the complementary piece, CTAATGT. But sometimes polymerase will grab regular G's, A's, T's, and C's, and sometimes it will grab the ones with the big fluorophores hanging off (we'll write these G*, A*, T*, and C*). Well, if you ever grab one with the big fluorophore in it, polymerase can't put any more bases on, and that piece of DNA is done. If there are enough pieces to begin with, you'd expect that this would eventually happen in every location:

GATTACA
CTAATGT*

GATTACA
CTAATG*

GATTACA
CTAAT*

GATTACA
CTAA*

GATTACA
CTA*

GATTACA
CT*

GATTACA
C*

And so if you put these pieces in jello in an electric field, then the last piece above should go all the way across the jello, but the first piece, being so cumbersome, should not. And so the electric field and the jello will put the pieces in sorted order by size. And if you read off the colors that you see, the furthest one, having a C* (again, last above) should glow yellow (telling you there was a C* at the end), and the second one, having a T* should glow blue (telling you there was a T* at the end).

If you read off these colors in order, then you should get the right complementary sequence to the DNA:

Yellow
Blue
Red
Red
Blue
Green
Blue

These are the exact colors that you would expect if the sequence were complemented by CTAATGT (which tells you the original sequence is GATTACA)! This process is called Sanger sequencing, and scored Sanger a/his second Nobel Prize. Sweet!

Gene Mapping (with applet!)

Before scientists were sequencing genomes, it was hard to find genes. The best way to go about it was to take lots of an organism --yeast, worms, flies, or mice-- and put them in toxic chemicals or expose them to radiation so that you would mutate some of their genes. This process is called "mutagenesis" and it has two types of effects. The first is somatic mutation, which is like a sunburn. It mutates a cell or group of cells, but cannot be passed to offspring. The second is mutation in the gametes, which will be passed to the offspring. This is much more interesting, because the mutation will now be present in every cell of the offspring, not just one or a few cells.

Once you have a bunch of offspring with mutations, then you can look for things about them that are funny. One of the great things about genetics is that you can look for almost anything here. If you find a fly with curled up wings, then you know you mutated a gene that influences the development of wings. If you find a mouse that gets sick easily, then you may have changed a gene involved in the immune system. Using this process you can find genes and make stocks of organisms that have interesting changes in these genes. Yea!

But then where are the genes? Well even this question is kind of hard to address. Where is the earth, for instance? One valid response is to stamp your foot and yell, "Right here!" but another response might be "Somewhere between Venus and Mars." And so we have be very specific when we ask where genes are. And a very useful thing to ask (quietly, without yelling) is "Where are these genes relative to each other?" Good question! And thank you for not yelling.

And a good way to find if genes are near each other is to see how they are transmitted to offspring. If my father has two genes for brown hair (BB) and two genes for dark skin (DD) and my mother has two genes for blonde hair (bb) and two genes for light skin (dd), then I must have one of each (BbDd). Now if these genes are on separate chromosomes, then the gene that I pass on to my children for hair color should have nothing to do with the gene that I pass on for skin color. And so if I tell you that I have a daughter whom I give the brown hair gene (B), then you should have no good guess about the skin gene that I give her. And so it is 50/50 that I gave her dark or light skin color, even though I know for certain what hair color gene I gave her.

On the other hand, if these genes are located right next to each other on the same chromosome, then it is nearly impossible to get one without getting the other. Think of the hair color gene as a friend of yours and the skin color gene as your friends very clingy date. If your friend walks across the room, then almost certainly, so does his/her date. So in this case, if I have a daughter and pass the brown hair gene to her (B), then you are almost certain that I passed on the dark skin gene (D), too. So in this case, if I tell you the allele I pass on for hair color, you almost certainly know the gene for skin color. Likewise, if I tell you the allele for skin color that I contribute, then you'll almost certainly know the allele for hair color that I pass on.

Conversely, you can run this principle backwards. If I tell you how often these genes are correlated in this way, then you can infer whether they are close or not, and even infer how close they are.

Click here to play with the mapping applet!