Wikipedia and Genomics Visualization: Separated At Birth?
One of the most fascinating presentations at our recent Foo Camp was one on visualization. A number of different folks ran through some of their visualization projects. There was a lot of mindblowing stuff, but what particularly struck me was the similarity between a visualization that Fernanda Viegas had done (with Martin Wattenberg) visualizing the change logs in wikipedia entries and one that Ben Fry had done on Single Nucleotide Polymorphism, or SNPs, (comparisons of the genetic varation between individuals.) While the similarities may simply be an artifact of a similar visualization technique, they are striking and thought-provoking, and suggest the organic nature of wikipedia.
Here's one of the "history flow" images showing how a wikipedia entry changes over time. (Check out the whole gallery. Some of them are very beautiful.) Interestingly, the image in the gallery that is most similar to the SNP visualization is an entry on abortion, which, being a controversial subject, doesn't just grow over time like many wikipedia entries, but has many deletions, moves, and other changes. So perhaps controversy forces evolution :-)
And here's one of Ben's SNP visualizations:
For a bit more detail:
Explaining his visualization, Ben writes: "When comparing the genome of two different people, you'll see single letter changes (called SNPs, pronounced "snips") every few thousand letters. An interesting feature of SNPs is that their ordering has distinct patterns, where sets of consecutive changes are most often found together. There are many methods for looking at this data, so this piece combines several of them into a single visual display. The project is described in greater detail in my dissertation [PDF], starting in chapter four."
Fernanda and Martin explain their technique here.
tags:
| comments: 7
| Sphere It
submit:
Subscribe to Comments on this Entry:
0 TrackBacks
TrackBack URL for this entry: http://radar.oreilly.com/mt/mt-tb.cgi/8542
Comments: 7
Here is a simple solution to the Wikipedia storage problem.
First, consider that an xor operation on a 2state Boolean set of strings will give you a maximum compression of the image of the difference between the strings.
The compression issue in Wikipedia is an expansion of this basic problem. Instead of a 2 state boolean, you have, say, an 128* state (dimension) "space" which requires a similar operation between successive vectors...an 128 dimensional xor operation. That general concept defines the limit of compression that any actual scheme is going to be capable of.
*This number is arbitrary, representing an old ascii code extent. Unicode compressions would be appropriately larger in the address space required, although the actual data matrices wouldn't be much denser.
Oops...that was a comment intended for the post on the Hutter prize.
Since I can't delete the post, let me add that the famous Tufte graphic of Napoleans invasion of Russia also resembles this graphic, and some interesting inferences can be drawn from that.
This shallow post needs some harsh criticism... Although visually they are very similar, the axis are too different to deserve the "separated at birth" title. First of all, Ben Fry's SNP plot does not have time on the X axis, rather, it shows the location of the SNPs in the aligned DNA sequences. The transitions summarizes percentage of SNP linked to which other SNPs, rather than where the block of text ended up in vaious time points. The two plot would be analogous if the Wiki entries at different time points were treated as blocks of DNA strings that mutated, translocated, deleted, over time, just like how strings of ATCG have changed over the eons.
The first mistake you can make in trying to understand data patterns through visualizations is by not reading the labels on the axis.
Norman --
I wasn't suggesting that they were identical -- in fact, I noted that it might be nothing more than a visualization artifact. But I do believe that the similarity in visualization does in fact suggest something interesting. The level of complexity, the way that similar bits are relocated to slightly different places, etc. suggests an organic nature in wikipedia. The visualization helps to make that idea concrete.
One of the most important things about visualization is that it helps to spark insight. There are fine-grained insights about detail, and there are broader, almost circumstantial (maybe even incorrect in detail) observations that are interesting at a higher level.
Ben Fry does some really phenomenal data visualization work. If he were to put together a coffee table book of his "beautiful evidence", I'll bet it would sell faster than an overhyped web2.0 socialsharingbookmarkmapsandmusic application.
Tim --
Point taken, I apologize about the harsh criticism. The two graphs definitely look very similar in its organic nature, albeit used very differently. It's very interesting that you pointed out that at a higher level, that these two very different sets of data converged to look so similar to tell a different story.
On another note, Beautiful Evidence by Edward Tufte contain a chapter on Napolean's march toward Moscow visualization, the poster can be seen here:
The transitions summarizes percentage of SNP linked to which other SNPs, rather than where the block of text ended up in vaious time points. The two plot would be analogous if the Wiki entries at different time points were treated as blocks of DNA strings that mutated, translocated, deleted, over time, just like how strings of ATCG have changed over the eons.
Post A Comment:
RECENT COMMENTS
Christian carter on Wikipedia and Genomics Visualization: Separated At Birth?: The transitions summari...
Norman on Wikipedia and Genomics Visualization: Separated At Birth?: Tim -- Point taken, I ...
Jon Collier on Wikipedia and Genomics Visualization: Separated At Birth?: Ben Fry does some reall...
Tim O'Reilly on Wikipedia and Genomics Visualization: Separated At Birth?: Norman -- I wasn't sug...
Norman on Wikipedia and Genomics Visualization: Separated At Birth?: This shallow post needs...
Jeff Beddow on Wikipedia and Genomics Visualization: Separated At Birth?: Oops...that was a comme...
Jeff Beddow on Wikipedia and Genomics Visualization: Separated At Birth?: Here is a simple soluti...


