Discovering biology in a digital world (Archives): May 2006

Wednesday, May 24, 2006

Time to fly

"I'm goin' where the sun keeps shinin'
Thru the pourin' rain,
Goin' where the weather suits my clothes.

Bankin' off the northeast wind,
Sailin' on a summer breeze,
Skippin' over the ocean like a stone."

- Fred Neil
Everybody's Talkin (Echoes)

Where am I going?

California?

Well, there too, for a few days, but blogwise, I'm going to ScienceBlogs. And you're welcome to come visit. My new link won't be functional for a few days yet, but I will post it when I'm ready for visitors.

Subject: Announcements

Tangled Bank #54 is up and ready

Yet another great collection of science blogamainia is live and ready at Science and Politics.

Hosted by Bora Zivkovic, a hard-writing science blogger extraordinare, it's Tangled Bank #54!!

Clear some time off your schedule and grab a cup of coffee, there's lots of good reading ahead.

technorati tags: carnivals, Tangled Bank, science

Wednesday, May 17, 2006

Hurrah for Syttende Mai!

Sunday, May 14, 2006

Hard-working birds on Mother's Day

At first, we thought the tent caterpillars were back for the summer.

Then, we saw a bird go into the "tent."

I guess you don't get Mother's Day off until after you've laid your eggs.

Happy Mother's Day to mothers and future mothers.

Subject: Birds

technorati tags: biology, birds

Thursday, May 11, 2006

Part II. Future Shock and Selenocysteine: it's time again to update the databanks

One of the surprises (for me anyway) in discovering the existence of selenocysteine (Part I. Future Shock and Selenocysteine), was the corresponding discovery that it's encoded by UGA. Ordinarily, UGA is a stop codon. If a UGA is in an mRNA sequence, it tells the ribosome that the job is done. It's time to pack up all the tRNAs and elongation factors and move on to the next project.

But in the case of selenocysteine, we have a "work-around." Sometimes UGA stops everything, sometimes the UGA says "put the selenocysteine right here." (Someone in the office joked that the ID g-o-d must be a programmer since he/she is trying to fix bugs.)

How do the ribosomes and tRNAs know whether to stop or go?

They feel the difference.

Seriously, the sequences at the 3' end of an mRNA fold into a special hairpin shape like the one shown here (1, 2, 3). In bacteria, this structure is called a "selenocysteine insertion sequence" or SECIS element. Eucaryotes have similar structures at the 3' ends of mRNAs for selenoproteins.

The RNA in the picture has a rainbow coloring scheme (Red, Orange, Yellow, Green, Blue, Indigo, Violet). Nucleotides at the 5' end are red, nucleotides at the 3' end are violet. You can follow the colors in the RNA backbone to see how the RNA is twisted around into a hairpin shape.

One of the tests that we used to give job applicants was to have them write a short script for translating a DNA sequence in 6 reading frames. We would give them a mouse pad with the genetic code and put them to work.

Selenocysteine makes this problem a whole lot harder.

Since the recognition feature is a secondary structure, locating the coding sequences for selenocysteines presents an interesting challenge to computational biologists. Finding these sequences requires a bit more than a regular expression.

Kryukov, et. al. describe an algorithm for doing this type of search (1). They've refined it in the years since this publication, but it seems that the information has yet to percolate through much of the world's bioinformatics community.

And here, I thought I was the only one who seemed to have missed this.

Nope.

Can we find selenocysteine in GenBank?
I started to wonder if the news about selenocysteine had trickled out beyond PubMed articles and into the rest of the NCBI.

Could I find selenoprotein sequences in the Gene database? I thought this would be a good place to start since the data are well curated and there are links to reference protein sequences.

I searched and searched, and lo and behold, I found them.

The sequence above codes for human selenoprotein P. U is the one letter symbol that represents selenocysteine. This protein contains an unusually large number of selenocysteines.

I only looked at a few of the reference protein sequences (labeled NP---) from the Gene database, but they all seemed to have selenocysteines.

So the NCBI Gene Database seems to be caught up, at least for the sequences that I checked out.

Mischief and Misannotations
I followed the links to the Conserved Domain Database. (I'm writing a book on this BTW, and the CDD is really, really cool).

When I got to a summary page, I choose the SelP_C domain (since more U's are on that side of the protein). This gave me a page with a Pfam alignment between my human SelP sequence and some sequences that were chosen for Pfam. (you can take a look at this yourself by clicking the link above. Change the format to Hypertext and click Show Alignment to see the selenocysteines in the query sequence).

Reading downward, the Pfam sequences, in the alignment below, are from cow, my query(human), rat, another human sequence, and zebrafish.

Every time my query sequence has a "U," the other sequences have a "c" (purple boxes above).

This is interesting and odd. Only one of the proteins with the conserved SelP domain has selenocysteine (and it's our human query sequence).

One interpretation that Kryukov suggested in 2003 (1), (and later regretted, I'm sure), is that through evolution, cysteine was substituted for selenocysteine in organisms like the rat and mouse.

I think the presence of the other human SelP sequence argues for another interpretation, especially since it's an older version of our query.

If we click the gi links to see the database records, we find something else that's interesting.

A note in the first sequence, from the cow, deposited in April 2006, shows that someone knew about the selenocysteines,

[MISCELLANEOUS] The selenocysteines are all encoded by the opal codon, UGA.

But apparently, no one bothered to put them in the amino acid sequence, since there aren't any selenocysteines there.

Maybe they didn't read the note.

Stranger, yet, the missing selenocysteines could be rationalized away by arguing that the protein sequence is just a conceptual translation - that is, it was determined by using the standard genetic code. Except that using the standard genetic code would have generated a much shorter sequence since UGA makes translation stop. So, instead of putting in the correct amino acid, the curators (Swiss prot?) typed in the wrong amino acid. Instead of using the U for selenocysteine, they typed a C for cysteine.

The rat sequence was also updated in April 2006 and we can see that the positions of selenocysteines also seem to be marked in the GenPept record (below), but, funny, there aren't any selenocysteines in the rat sequence.

The other human sequence for SelP and the zebrafish sequence show the same kinds of annotations. Yet neither one contains selenocysteines in the amino acid sequence.

Could the source of the sequences (Pfam) be the source of the problems?
I'm not sure where the problem originates but if I search the Pfam database at the Sanger Center, for selenocysteine, I get a list of 31 proteins that contain it, and again, I get an annotation that indicates that someone is aware of selenocysteine.

SelP is the only known eukaryotic selenoprotein that contains multiple selenocysteine (Sec) residues...

Yet when I do a seed alignment, none of the amino acid sequences contain selenocysteine. Here is one example:

SEPP1_HUMAN/22-250 QDQSSLCKQPPAWSIRDQDPMLNSNGSVTVVALLQASCYLCIL
QASKLEDLRVKLKKEGYSNISYIVVNHQGISSRLKYT

The selenocysteines are missing here, too.

If I take this sequence and do a blastp search at the NCBI, I get quite few perfect matches. Just like Pfam, there are sequences in GenBank that are not yet fixed.

What is our take home message?
The simple take-home message, of course, is to be aware the FASTA sequences for selenium-containing proteins are likely to be wrong. If the annotations say there should be selenocysteine and you can't find a "u" in the sequence, it probably hasn't been fixed yet. Those of us who use the date must always be skeptical and read the literature.

The second take-home message concerns process. The acceptance of new ideas in science generally prompts some re-evaluation of older ideas. We evaluate older concepts more critically in the electric light of new ideas. It would be helpful if these processes could be applied more quickly to sequence data and bioinformatics algorithms. These results support the need for scientific curators who can read the literature, add annotations, and even make corrections in amino acid sequences, from time to time.

The amino acid matrices that we use for protein comparisons, were updated when more sequences became available for doing alignments. We all switched from using PAM to BLOSUM matrices. Maybe it's time to make update the Pfam domains as well.

Selenocysteine exists.

It's time to deal with it and get on with the work.

References:
1. Kryukov GV, Castellano S, Novoselov SV, Lobanov AV, Zehtab O, Guigo R, Gladyshev VN. 2003. Characterization of mammalian selenoproteomes.
Science. 300:1439-43.

2. Diamond, AM. 2004. On the road to selenocysteine. Proc Natl Acad Sci U S A. 101: 13395-13396.

3. Yoshizawa S, Rasubala L, Ose T, Kohda D, Fourmy D, Maenaka K. 2005. Structural basis for mRNA recognition by elongation factor SelB. Nat Struct Mol Biol. 12:198-203.

Subject: Doing biology with bioinformatics

technorati tags: bioinformatics, selenocysteine, biochemistry, biology, blast, genetics, genomics, DNA, RNA, Science Education

Carnival time

The carnivals are up and running.

Tangled Bank 53: Go climb a tree! looks at the tree of life from a higher point of view.

I and the bird #23 at birdDC asks the question, is it possible to do birding on the internet?

Enjoy!

Wednesday, May 10, 2006

Part I: Future Shock and Selenocysteine

Future Shock

When I was in high school, we read an intriguing book by Alvin Toffler called "Future Shock."

Now, the book is over 30 years old but some of the predictions Toffler made were uncanny.

One of the ideas Toffler proposed was that people could become overwhelmed and disoriented with the onslaught of new information. My field is a good example. For Geospiza, helping people manage large amounts of new data, while maintaining the old, is our whole raison d’être.

But going back to Toffler, he predicted that the increasing rate of societal change would cause some people to experience symptoms of "Future Shock." One morning you might wake up in a familiar place, but everything would seem a bit different and strange. I'm channeling the ghost of Jim Morrison a bit, but the The Doors had the feeling nailed down.

It's never bothered me though, until the other day.

I learned something new that shook one of my core beliefs.

We have a new amino acid in the genetic code.

Sure, go ahead and laugh.

This might seem like an odd thing to be bothered by, but the genetic code was solved in the early 60's. Some things in life are NOT supposed to change. Yeah, there are some variations in translating DNA from different species, and we expect to learn new things from deciphering the genome, but no one expects changes in something as fundamental as the genetic code.

So, it was a bit jarring to find out that now there are 21 amino acids.

And it was even a little reasurring that no one believed me.

My husband kept insisting that this was a post-translational modification or some strange anomaly from archeabacteria.

Naturally, I was forced to hunt down a bunch of abstracts and read them to everyone (I love PubMed!).

Selenocysteine: our 21st amino acid

It's true. The new amino acid is selenocysteine and there are even special tRNAs that can add it during translation. The translation machinary recognizes the UGA stop codon, plus special secondary structures in mRNA, and puts in a selenocysteine instead of stopping.

This amino acid is uncommon, but GenBank has 7904 entries for selenoproteins and 3293 RefSeqs. Many are probably orthologs (the same protein in different organisms) or our favorites, those wonderful "hypothetical proteins," and I think some of the records represent the same sequence, but there's still a fair number to be found (except in Pfam, but more on that in part II).

Selenoproteins are pretty wide-spread, too. At least 25 selenoproteins are known in humans and I found papers describing them in mouse, fruit flies, humans, fish, bacteria, and protozoans. Most selenoproteins only contain one selenium and it's positioned at the active site. One selenoprotein contains so many seleniums that this one protein, alone, accounts for half of the selenium in a cell.

I'm not too sure yet, about the function of these proteins. Some of the selenoproteins may be important in redox reactions, one might prevent heavy metal toxicity, and there seems to be some link to cancer, too.

And, guess what?

I'm wasn't the only one who was taken by surprise. It looks like some of our favorite bioinformaticists and genome annotators missed this one, too.

Stay tuned.

In part II, we look at the infinite loop of information updates and an interesting conclusion drawn from erroneous annotations.

Subject: Doing biology with bioinformatics

technorati tags: biology, bioinformatics, blast, genetics, genomics, DNA, RNA, Science Education

Wednesday, May 03, 2006

Animalcules Volume 1, Issue 7

It's the 4th of May, almost summer time and time to think about microbes.

Some of them are giant, some are just, well, ... unusual.

I guess we'll let the procaryotes go first, since apparently they're working by the clock. And you probably thought it was simple being a single-celled organism without a nucleus. You can read about them in Clocks in Bacteria IV: Clocks in other bacteria, brought to you by none other than Circadiana.

Next, in keeping with the image of pathogens as giant fluffy toys, we have a collection of Hands-on, Fun Microbiology Activities from the ASM. Let's Get Small, Yeast on the Rise, and Fun with Fomites are some of the entertaining activities that you could either try at home, or use to liven up a class with the small fry, or even larger fry. Fun with Fomites exams the wonders of things that grow on your kitchen cutting board, or even the pennies in your pockets. And there are plenty of helpful suggestions for the cognitively impaired, but since this isn't a political sort of blog, I won't go there.

Do your bacteria keep swimming away? Isn't chemotaxis a pain? I remember when researchers studied flagella and bacterial motion by using anti-flagella antibodies to pin the little suckers to a slide and then, they would watch the bacteria twirl around and around with a microscope. Ah, torturing bacteria! The BioCurious have found a better way. "Studying Bacteria with Atomic Force Microscopy" looks far more fun than old antibody and slide method.

Further representing the uncultured world, we have the GMO pundit asking, 'Is Studying Soil DNA Any Value to The Australian Farmer Part 1. About 99% of soil bacteria have never been grown." Even if they're not, my collaborators have got college biology students doing PCR and sequencing dirt, so the farmers may not care, but knowing your soil bacteria, will still be important for getting a good grade in general biology.

In the Deep-Sea News, we learn about some lovely cyanobacteria, the Deep phytoplankter Prochlorococcus, a "plain little mite at first site" but very productive in terms of biomass.

Ewen Callaway from Complex Medium contemplates the true meaning of diversity, in Does microbial diversity count? Is is important to preserve the bacteria in a desert oasis?

A carnival with bacteria, of course, would never be complete without our faithful laboratory, friend, the mouse of the microbes, the king of our colon, the one and many, E. coli. In E. coli, Shigella, and Creationism, Mike the Mad Biologist even manages to link our fuzzy friend to creationism and Shigella, with an amazing amount of intestinal fortitude.

Now, Mike the Mad also has me worried about cleaning my aquarium. He writes More on That NY Times Article about the dangers of getting Salmonella from your fish tank. And, if that wasn't enough, well, he explains why tummy aches in Australia (Australia, Agriculture, and Antibiotics) might last longer than you'd think.

Who knows what evil bacteria lurk in the hearts of men? The protozoa know. Inspired by the TV, Paul Orwin identifies a microbial influence on pop culture (Microbiology and Pop Culture).

So far, we've let procaryotes have all the fun. This bird has had enough. Maybe it smelled the Campybacter. Or could it be that this chicken has read Emerging Disease and Zoonoses #13--new swine influenza virus detected and just wants to play it safe? Maybe this rooster watched this short hysterically funny video clip and just isn't willing to play chicken?

Before we go too mad, perhaps we should foam at the mouth to read this story, and accompanying links on Rabies, the Novel, from Bora Zivkovic of Science and Politics.

And for one last, truly viral article, lets contemplate our friend the mosquito, one last whiny time, and read about West Nile Virus, a last friendly parting thought, from our friends at MicrobeWorld.

Check out the schedule for the next episode of Animalcules!

Subject: Microbiology

technorati tags:
animalcules, biology, DNA, genomics, microbiology
science education

Discovering biology in a digital world (Archives)

Wednesday, May 24, 2006

Time to fly

Tangled Bank #54 is up and ready

Wednesday, May 17, 2006

Hurrah for Syttende Mai!

Sunday, May 14, 2006

Hard-working birds on Mother's Day

Thursday, May 11, 2006

Part II. Future Shock and Selenocysteine: it's time again to update the databanks

Carnival time

Wednesday, May 10, 2006

Part I: Future Shock and Selenocysteine

Wednesday, May 03, 2006

Animalcules Volume 1, Issue 7

About Me

DigitalBio at Science Blogs

Geospiza Education News

Links

Previous Posts

Archives

Favorite Activities from Geospiza Education

Digital biology and bioinformatics

Scientists and teachers

Great sites

Awards