User:ElNando888/CBPP

With the very upcoming advent of Cloud Lab, I've been thinking about using this new and powerful pipeline to attempt to solve a problem that has been bothering me for quite some time.

The issue

In very simple words, I have been wondering whether we, EteRNA players, have actually learned the rules that prescribe how RNA sequences fold, or did we merely learn to make SHAPE data that mimic targeted secondary structures ?

I was first prompted to this thought by an old thead on the GetSatisfaction forum : Non-Canonical Base Pairs and SHAPE: To Boost or Not to Boost?.

Yesterday (I'm wrting this on 10:25, 26 April 2013 (UTC)), I found this fascinating scientific paper. The bottom-line of this document is simply that SHAPE signal

correlates strongly with cis-Watson-Crick base pairs (the canonical ones)
does not correlate strongly with noncanonical pairings
does correlate strongly with stacking interactions

The scientific team went then ahead and tried (apparently with some success) to define a probability profile based on SHAPE data, which better reflects whether bases are actually paired or not. And it turns out the the best models are those that take into account not only the SHAPE score of the nucleotide under consideration, but also the scores of its neighboring bases (see the paper for details).

What does all that mean ? Simply put, when we look at lab results in EteRNA, we can actually never tell with confidence whether these shades of blue or yellow are due to actual base pairings, or if they are caused by stacking interactions. It could be the one, the other, or a little bit of both.

Edit: come to think of it : what do you think was happening in those long stretches of A's in loops in the first rounds of Cloud Lab ?

This said, their experiments were run on only 7 solved structures taken from PDB, and they were using the NMIA reagent.

The project

I say, we (EteRNA) do it, just better :)

With the throughput of Cloud Lab, we should be able to get data for hundreds of solved structures. For starters, we will be getting SHAPE data obtained with the reagent in use here, which is the other (faster) reagent named 1M7. There may (or not) be little differences with NMIA.

Also, we will be getting other chemical mapping data, if I'm not mistaken, DMS and CMCT. With some luck, those mappings will have a certain degree of orthogonality that will allow us to distinguish more clearly between signals coming from base pairings and signal coming from stacking interactions.

Then, we would have to come up with a probability profile (which I'm thinking to dub CBPP for Consolidated Base Pairing Probability) that maximizes the predictive power of these combined chemical mappings.