Turned up to eleven: Fair and Balanced

Wednesday, April 05, 2006

Known Unknowns
Our beloved Secretary of Defense, the wise and all knowing Don Rumsfeld, once famously discussed "known knowns, known unknowns, and unknown unknowns" with respect to the know well known phantom WMD in Iraq. His comments were widely mocked at the time, but this is in fact an interesting area of thought, at least for us scientist types. Not so much with respect to phantom WMD or ill-advised foreign wars of choice, but in much more esoteric arenas.

There has been a bit more than a century of microbiological research done so far. A great deal of work occured in the pre-genetic era, and the vast majority of microbiology time is pre-molecular biology. However, the march of progress and the exponential growth of the science community (as marked in people working in the field as well as the literature) leads to a spiralling upward of information. A new entrant into the field of microbiology (such as a new grad student or young assistant professor...) might well be concerned that there are no interesting problems to solve.

So, to illustrate the depths of our ignorance, I will point you to a recent paper on lowly Escherichia coli K-12. It isn't just that E. coli is by far the most studied organism on the planet, the K-12 lab strain is essentially the progenitor of all the lab strains used for molecular biology research (if you ever cloned a gene into E. coli as part of a classroom or research project, it was almost certainly a derivative of K-12). In this particular paper, the investigators identified some genes that are crucial to biofilm formation, a recently very well studied phenotype.

Now E. coli has approximately 4 million base pairs of sequence, which has been completely identified on a base by base basis (to see a list of all the microbial genome projects, see this NCBI page). The K-12 sequence was completed in 1997, and 16 total E. coli genomes are either complete or in the assembly stage of sequencing. The total number of identified protein coding genes is 4237. If you click on the link you will go to a table of all the predicted and known proteins in this extraordinarily well studied organism. If you go down that table, you will find, very quickly, "hypothetical protein b0005" and "hypothetical protein b0006". If you go through the entire list, you will find quite literally thousands of "hypothetical" and "putative" proteins. If it is called "putative something" that means that based on sequence homology someone has predicted that sequence should encode some particular functional protein. If it is labeled "hypothetical protein" that means quite literally that no one has any idea what it does.

Get to work!!!