Leapfrogging the genome of a flatworm

The tiny flatworm planarium has fascinated scientists for over a century because of its amazing regenerative capacities. If you cut it into small pieces, each can regrow the missing parts needed to produce an entire animal. Why doesn't the same thing happen in mammals? Researchers hope to find answers through studies of the genetic programs at work in the worms' stem cells. This work requires a full list of planarium genes, and a look at the transcriptome – the subset of molecules produced in specific types of cells. But obtaining these types of information has been incredibly difficult for planaria and most other species, especially when the genome sequences are not available. Now four groups within the MDC's Berlin Institute for Medical Systems Biology (BIMSB) have combined methods in a unique way, providing a shortcut to obtaining the transcriptome of planaria cells. The study, which was carried out by the groups of Wei Chen, Nikolaus Rajewsky, Christoph Dieterich and Stefan Kempa, appears in the July issue of Genome Research. The scientists say their datasets should give insights into planaria regeneration, and the same approach can provide a tremendous amount of new information about other organisms as well.

Two strategies carried out in parallel were combined to make the project successful. On the left side: scientists found an efficient way to construct a normalized full-length library of planarium cDNA. On the right: RNAs were extracted, cDNAs were built from them, and a complementary sequencing strategy was used. The combination of the two approaches led to a wealth of data about the molecules encoded by the planarium genome.

"Recently the genome of the planarium S. mediterranea was sequenced, and it was estimated to contain over 20,000 genes," Wei says. "Most of these are theoretical predictions made by a computer, based on comparisons to molecules found in other organisms and our knowledge of gene structure. True protein-encoding genes are transcribed into messenger RNAs that are then translated into proteins, but the majority of these molecules have never been seen in experiments." This means, he says, that their functions in cells are unknown. And some of the predicted genes might be artifacts – stretches of DNA sequences with enough features to fool the computer into thinking that they are real, but which are never actually used to make RNAs or proteins.

The usual path to finding genes and uncovering their functions, he says, requires obtaining a high-quality version of an organism's genome. With sequence information in hand, you can go looking for specific RNA molecules that have been produced in particular types of cells, and proteins made from the RNAs.

But there are several catches. Genomes are sequenced by cutting an organism's DNA into tiny fragments, reading their chemical sequences, and then assembling them into a long, linear script. You can compare it to taking millions of copies of a play by Shakespeare, cutting the text into fragments of random lengths, and then trying to reconstruct the original text by identifying overlapping words at the beginnings and ends of the fragments. The longer the pieces, the easier it is to combine them into a faithful version of the original. But many sequencing methods produce very short strings of texts, and organisms often have highly redundant genomes – as if Shakespeare were to use the same sequences of words over and over throughout the play. These facts make it hard to obtain an accurate version of the genome and then study the transcriptome.

The BIMSB groups have a range of expertise in sequencing DNA and RNA molecules, as well as identifying proteins found in experiments. As the labs began working together at the MDC, they reasoned that these different types of information could be combined. Since the sequences of RNAs and proteins reflect those of genes, any one type of molecule can provide information about the others. With the sequence of a protein or RNA in hand, scientists can reconstruct and identify the sequence of the gene that produced it.

"These three types of data can be used to support each other and give a real picture of the genome and its products," says Catherine Adamidi, a scientist in Nikolaus' lab, who played a key role in initiating the project. She began organizing and working out the details of a major project with the other groups that would draw the information together into a new, detailed picture of planaria molecules.

Yongbo Wang and other members of Wei Chen's lab developed an efficient strategy to resequence a major fraction of the genome using two sequencing techniques that produced both long and shorter DNA fragments. Xintian You began the complicated job of assembling them into thousands of transcripts. "Usually the the assembly process is very challenging using only short sequencing reads," Wei says. "We thought this could be improved by combining the long reads, which are relatively few in number, with many shorter sequence data."

The high dynamic range of mRNA expression poses another problem for comprehensive mRNA sequencing. "This is a difficult process," Nikolaus says, "because the data is skewed. Cells have thousands or millions of times more copies of some RNAs than others, and this has to be taken into account to establish a set that really reflects the output of the genome." Therefore, in order to sequence as many different RNA transcripts as possible, Yongbo and Catherine applied an efficient method to balance the representation of RNAs of different abundancies before carrying out sequencing.

The data itself did not reveal its quality. To carefully assay the quality of the assembled transcripts, Dominic Gruen and Christoph Dieterich applied different computational strategies. Together with experimental validation, this information could demonstrate the overall high quality of their data.

Guido Mastrobuoni and Stefan Kempa extracted huge collections of proteins from the cells and studied their sequences. The latter work revealed protein sequences that confirmed the existence of 4,200 RNAs transcribed from genes. Many of these were new – they hadn't been predicted by earlier, purely "computational" studies of the genome. Confirming the presence of these molecules in stem cells showed that the overall approach provided a valid new method to study the genome.

The experiments dramatically expand scientists' knowledge of planarium genes and the molecules that they produce, giving them a stronger foothold as they search for the secrets of planarium regeneration. The groups are freely providing the data to the research community in a public database.

"In addition," Wei says, "this is a robust method that can be applied to other important laboratory organisms whose genomes have not been assembled or thoroughly studied."

Nikolaus, the scientific head of BIMSB, says that the study has another important message. "Our approach within BISMB is to combine the efforts of groups working on sequencing, wet lab experiments, and computation to solve significant biological problems," he says. "This project shows that you can indeed leap ahead by combining these diverse and complementary approaches in creative new ways."

- Russ Hodge  

Highlight Reference:

Adamidi C, Wang Y, Gruen D, Mastrobuoni G, You X, Tolle D, Dodt M, Mackowiak SD, Gogol-Doering A, Oenal P, Rybak A, Ross E, Alvarado AS, Kempa S, Dieterich C, Rajewsky N, Chen W. De novo assembly and validation of planaria transcriptome by massive parallel sequencing and shotgun proteomics. Genome Res. 2011 May 2

The full text of the paper