Sunday, January 22, 2006

Thinking like a programmer, searching like a fool

A recent news report told the story of a woman who was robbed and lost her entire thesis project because it was on a memory stick in her purse. The sensible part of me cringed when I read that she didn't have her master's thesis backed up somewhere on a hard drive, but the other part was intrigued by the way she solved the problem. She found her thesis in a green dumpster by trying to think like a thief.

I have to admire her method. I can make better guesses about searching for things, too; when I try to think like I'm someone else. Only, in my case, I'm not a graduate student thinking like a thief, I'm a biologist trying to think like a programmer.

This approach doesn't work with everything. Like, why do GenBank records give the mRNA size in basepairs? And why do so many people think "data set" is one word? And why does the chicken cross the road?

Some questions just don't have a good answer, no matter how you think about them.

Nevertheless, it does seem that programmers use one kind of logic and biologists use another. Neither is ideal, but thinking like a programmer can help when you encounter puzzling results with bioinformatics tools. Microbes and biological systems may follow biological logic, but web server applications and software packages follow logic of a different kind.

One of these unexpected moments arrived just the other day. I began my quest at the strangely-named OMIM database by searching with SOD1. (I say "strangely named" since OMIM is a database of human genetic disorders, not just those found in men.) Notice below, in the image, the link, creatively named "Links," on the right-hand side of the page.

Normally, I ignore Links since I use Safari (by default) and clicking Links doesn't do anything. But for the sake of completeness, I thought I should give Links a try, so I opened the page with FireFox, clicked Links, and selected PubMed. Then I sorted by Date and looked at the results.


Those results couldn't be right! The most recent paper was dated 2004. Surely people are still researching ALS and publishing articles on SOD1!

My biologist's intuition was piqued. Naturally, I redid the experiment under slightly different conditions. This time I searched for SOD1 directly from PubMed. I found articles in PubMed that date a couple of weeks in the future.

Why were the PubMed links from OMIM so far behind?

Time for another experiment.
What would happen if I chose Links again, but started from a different database?

This time I found SOD1 in the Gene database and selected the PubMed link from collection in the side bar. Now, the most recent paper was from Oct. 2005. Still a bit behind, but closer to current than 2004. You can see the results from all three searches below.

As a biologist, results like these are really kind of disturbing. I expected (erroneously) to get the same results no matter where I started searching. Could this be a problem with link rot? Not link rot at the NCBI! With my biologist hat firmly on my head, I did another experiment to look at Links to a database other than PubMed.

Links to Structures
This time I clicked Links from OMIM and chose the Structure database (MMDB). Only four structures appeared for SOD1. Searching directly from the Structure home page, I found 19 structures. Very worrying. Worse yet, I've used this method! : (

A quick flashback to software testing
Sometimes I get recruited to help out with software testing. These episodes give me the opportunity to read interesting test plans with puzzling questions that ask if the program showed the expected behavior. (Expected by who, I wonder?) Needless to say, it's not always the behavior I expected to see (thinking as a biologist, remember) even though it's often the behavior expected by someone.

That night, I had strange dream about previewing filters and running lost through large pipes and complicated pipelines. In the morning I decided to tackle the problem by thinking like a programmer.

Maybe the software was working the way it was supposed too. Maybe there was another explanation and I didn't do the experiment that I thought I did.

Could I test that?

Fortunately, yes. Clicking the Details tab in the PubMed results lets you see the search that you really did, not the search you thought you did.

Here's what I really did:

But what did I search for?

It turns out that a straight search from a database home page, like PubMed or MMDB, occurs without filters. That is, you search all the records in the database and find the ones that match your query.

But, if you begin searching from a different database, like OMIM and you select Links, you see the world through a database-colored filter. Instead of searching of all the PubMed records with SOD1, you only search the subset of PubMed records that are linked to OMIM.

Rather than searching all the records in the PubMed or Structure database, with SOD1, as I thought I did, when I chose Links from OMIM, I searched a filtered database with only 1% of the records from PubMed (147,450 out of 15,896,470).

The take home message
Links did behave in (what I assume) was the expected way. (I wish I hadn't sent that bug report! Sorry!)

It just appears that no one has added any PubMed or Structure references to OMIM for the past couple of years.

Did anyone expect that?


technorati tags: , ,


Post a Comment

Links to this post:

Create a Link

<< Home