Meyer Lab Header

Meyer Lab

RNA Structure and Transcriptome Regulation

Research Overview

Bioinformatics of RNA structure and transcriptome regulation

Introduction and Lay Summary

When the human genome sequence was released more than a decade ago, it came as a surprise to many that the number of protein-coding genes was not radically different from the corresponding gene count of the seemingly more humble nematode Caenorhabditis elegans (C. elegans). The current gene counts (20313 for human (GRCh38.p5) versus 20447 for C. elegans (WBcel235)) are stunningly similar. The gene count itself is thus only a poor measure for the complexity of the corresponding organism.

Another surprise finding in the wake of the human genome sequencing project was the realisation that only a small fraction of the genome (<2%) actually encodes protein information. Moreover, many genes seem not encode any protein product at all (25180 so-called RNA genes (GRCh38.p5)). Moreover, even the primary transcripts of protein-coding genes contain a seemingly disproportionate fraction of non-coding nucleotides (introns and untranslated regions).

The primary products of all activated genes are transcripts (RNA sequences). The functional products of these transcripts are proteins as well as functional RNAs which constitute key, cellular players in any organism. How and when any of these products are generated is a fine-tuned process that e.g. depends on the tissue-type and developmental trajectory of each individual cell. As the functional products define the current state of each cell (whether this is a state of disease or health), it is of key importance to
understand how the different functional products of the transcriptome are made. Without this knowledge, we not only lack information on why certain products are made, but also have no means of correcting for erroneously produced products if the cell is an a state of disease. Somewhat suprisingly, however, the molecular mechanisms underlying transcriptome regulation remain largely underexplored.

We hypothesize that RNA structure features and trans RNA-RNA interactions between two different transcripts play decisive functional roles in regulating gene expression on transcriptome levels. To this end, we devise new computational methods that allow us to discover new mechanisms of transcriptome regulation based on sequence information alone (e.g. RNA-seq transcriptome data). Due to the size of today's transcriptome data sets, we can even detect subtle mechanisms of transcriptome regulation with significant statistical evidence that would be hard or impossible to detect using the best experimental methods, see our recent analysis of A-to-I RNA editing in the fruit-fly as one example.

Beyond the one-dimensional view of transcripts

More often than not, figures in textbooks or on educational web-pages illustrate the Central Dogma of Biology by depicting transcripts as linear or wavy sticks inside a eukaryotic cell, with transcription and splicing seemingly happening consecutively. What we know from many dedicated experiments, however, is that processes that alter the primary transcripts (e.g. splicing, RNA editing and RNA structure formation) happen co-transcriptionally, i.e. while the RNA sequence is being transcribed from the genome. Similarly to protein information, information on RNA structure or potential trans RNA-RNA interaction partners can be directly encoded in the transcript itself. This makes it evolutionarily robust as any regulatory signals are directly encoded in the sequence itself. We thus expect that RNA structural features and RNA-RNA interactions are widely used for regulating gene expression on transcript level.

Modelling RNA structures in vivo

In order to devise computational methods for detecting the RNA structural features that are functionally relevant in vivo, it is worth acknowledging the complexity of the cellular environment and the impact this may have on the structure formation process, see our review paper. By devising the new RNA secondary structure prediction program CoFold, we showed that it is possible to capture the overall effects of the speed and directionality of transcription in vivo and also confirmed an earlier, long-standing hypothesis by Morgan and Higgs from 1996. Our method yields significantly improved predictions, especially for long transcripts (> 200 nt) such as ribosomal RNAs. We know already from one of our earlier, in silico studies that the sequences of structured RNAs not only encode information on their final RNA structure, but also on how these RNAs fold in vivo during co-transcriptional folding.

Figure 1: Arc-plot for the HDV ribozyme made using R-Chie. Each arc represents one pair of base-paired alignment columns. Arcs and the alignment at the top show the alternative structure and the active structure; those at the bottom the inhibitory alternative structure. The left legend specifies the percentage of canonical base-pairs for each arc. The right legend colour-codes the nucleotides and specifies the evolutionary evidence supporting each arc.

It turns out that orthologous transcripts from related organisms also have similar co-transcriptional folding pathways and that distinct transient RNA structure features can be as conserved and functionally relevant as those of the final RNA structure, see [1], [2] and [3]. This has significant implications for many state-of-the-art methods in RNA secondary structure prediction as these typically assume that any given transcript folds into exactly one functional RNA structure. A probabilistic method called Transat developed earlier by us aims to address this problem and has allowed us to detect individual, conserved RNA secondary structure features of pseudo-knotted structures, ribo-switches and transient structures which are otherwise notoriously difficult to predict.

RNA structure features involved in splicing regulation

Figure 2:
(A) Genomic context of identified editing sites.
(B) Distribution of conversion types for four tissue types.
(C) Percentage of common editing sites between pairs of tissues.
(Bottom) Gene CG5850 is differentially expressed between head (blue) and digestive system (red) and editing and splicing may affect each other. X-axis: exons of the gene, y-axis: number of reads normalized by library size. Arrows show editing sites. The purple box is predicted to be alternatively expressed.

Viral genomes such as Hepatis-C and HIV-1 are known to encode functional RNA structure in protein-coding regions as one major constraint for their genomes it to remain short. We contributed early on to these studies by showing that these RNA structures can be reliably predicted provided the know protein context is explicitly taken into account, see [1], [2] by us and also [3]. Functional RNA structures overlapping protein-coding regions, however, are not the preserve of viral genomes, but can also regulate the alternative splicing and translation of eukaryotic protein-coding genes e.g. in Arabidopsis thaliana and mouse and human. In order to explore the link between RNA structure and alternative splicing on a transcriptome-wide scale, we recently analysed tissue-specific high-throughput transcriptome data from the fruit fly. Using a new, probabilistic analysis pipeline that explicitly captures the ADAR-requirement for double-stranded regions, we identified around 2000 novel editing sites as well as more than 200 regions where local RNA structure changes due to A-to-I RNA editing are likely to induce corresponding changes in the splicing pattern, see our paper for details.

Figure 3:
(Top) Arc-plot for the highlighted region of the Cip4 gene containing a predicted, conserved RNA secondary structure overlapping RNA editing sites (red arrows) that could influence alternative splicing via structural changes. The left legend colour-codes the nucleotides according to the evidence supporting each arc, see also Figure 1. Figure made using R-Chie. (Bottom) Gene structure of the Cip4 gene with grey box highlighting the structure-containing part at the top.

Trans RNA-RNA interactions regulating the transcriptome

RNAs not only have the potential to form RNA structure, but can also interact with other RNAs in trans. These trans-interactions involve the same simple structural building blocks as RNA structure features, i.e. hydrogen bonds and stacking interactions involving pairs of complementary nucleotides ({G,C}, {A,U} and {G, U}). In terms of evolution, it is much more straightforward to evolve a specific trans RNA-RNA interaction than to come up with a (properly folded) protein that would engange in a similarly specific protein-RNA interaction. We therefore hypothesize that many novel biological classes of trans RNA-RNA interactions (beyond the already well-known classes such as miRNA-mRNA and snoRNA-rRNA) remain to be discovered. We have shown in a range of settings how of the comparative, in silico approach can be harnessed to significantly improve upon existing state-of-the-art methods. We thus continue to develop new, computational methods that allow us to make discoveries that would otherwise be difficult to make. To this end, we also collaborate with dedicated experimental groups that allow us to generate large-scale transcriptome data set (which constitute the input to our methods) and that test our high-ranking predictions in dedicated follow-up experiments.