User:ElNando888/Blog/Designing for the A B lab
The newest lab brings us a step closer to the goal of designing sequences with an actual practical application in the real world, in other words, a possible effect on medicine in general.
Getting there wasn't exactly simple. Maybe it was a bit serendipitous that I grabbed the reins of the Flash coding in the project about a year ago. And the efforts I placed into integrating a different folding engine (Vienna2) was a personal wish, shared by a few other players, though I wasn't asked for it by the team. But it did open the door for subsequent integrations, specifically Nupack, and the capability to properly model multistrand puzzles within the game.
However, this integration is not without flaws. The first of them is speed. While I tried hard to optimize and stabilize the MFE routines, the engine is still globally very slow compared to the Vienna ones. I'll keep trying to improve on that, but I can't make promises, specially considering that in the long run, we don't really want to continue working with the Flash platform...
When it comes to designing lab sequences, there is currently a second and arguably more important problem: base pairing probabilities (dotplots) aren't available for multistrand targets. Yes, if you open the dialog, the applet will display "something". The data shown is utterly useless though, because the applet doesn't currently take the concentrations of the oligos into account in the partition function calculations.
Fixing that problem is unfortunately not as simple as saying "just do it". I won't get here into the gory details as to why, because only software engineers would be able to follow, but in general terms, there are important architectural differences between the way the routines were coded in the Nupack engine (specifically as individual steps instantiated as a series of command-line tools using a bunch of intermediate buffers stored in the form of files, forming analysis/design "pipelines"), and the way we'd like to use them inside the game (a single web-based executable instance using no filesystem). Overall, it makes the task of properly translating the routines to Flash long, heavy and complicated. At this point in time, the team hasn't decided whether I should finalize this particular integration, and if it's deemed necessary to do it (remember, we may eventually not want to invest more time and efforts on a platform we'd like to abandon), what priority to assign to it.
Which leaves us with the current situation where players can only rely on the predicted MFE within the game. However, it is possible to obtain full partition function analysis by using the Nupack server, and I'm going to show you guys the way I analyze my own sequences, with respect to the current lab goals.
Disclaimer: most of you know that I'm a software engineer and used to be a simple player (well, if you didn't, now you do). I am no trained scientist, neither in biochemistry, nor in any other scientific field. So please keep in mind that what I'm presenting below is only as possibly good, and necessarily as bad as my understanding of the Nupack engine and nucleic acids thermodynamics in general.
We first need to specify global parameters. So, we're working with RNA strands, not DNA, and the temperature is the default one 37°C. However, we have 3 different strands in the solution, and we will also indicate a maximum complex size of 3, which simply means that we're telling the engine to ignore cases like a complex with 5 strands (say, 1 design, 2 TB-A and 2 TB-B) all bound together, which we believe is extremely unlikely to occur. Putting this limit in place here saves a lot of CPU time, but be careful with it, in a different context (other sequences with a different goal for instance), this (un)likelihood should be re-evaluated.
We can now start filling in the specifics:
I'll be using the example sequence I submitted in the R2 lab target. A few notes:
- Naming the strands is optional, but recommended
- Pay attention to the concentration units.
- I typically do 5 runs, corresponding to the 4 "key" states featured in the R2/R3 targets, plus a fifth "state" (more on this later). In the first run, I used the concentration values matching the first state of the R3 target, so 0 nM for TB-A and 5 nM for TB-B.
- The concentration specified here for the design itself, must be very low, so use 1 pM.
Time to click the 'Analyze' button. After a while, a results page appears. Before anything else, a small tuning that you only need to do once: click the 'Histogram filters' button, change the inputs like follows:
and click 'Redisplay'. The results are recalculated, and should now include everything.
At the top of the results page, we have the base pairing probabilities plot (click the picture to enlarge it)
Nice, eh? But what exactly are we looking for here? Well, this is state 1 of the R3 puzzle, and thus, our goal is to have the probability of formation of the MS2 hairpin as low as possible.
A simple way to approximate this probability is to simply use the probability of formation of the outer-most pair.
For the considered design, it's 8-26.
Now we just click the 'Download data' button, and we find:
We have our first data point, 4 to go.
Go back to the results page, scroll down to the bottom, and click the 'Change strand concentrations' button:
We enter now 5 nM for both oligos, which corresponds to the state 2 of the R3 puzzle, and we recompute.
Pairing probabilities have changed substantially, and the approximate proportion of MS2 hairpin in the solution becomes:
58.88%, for a state where the expectation is that the MS2 signal is raised.
And so on, until all data points are gathered.
The 4th row in the above table is the "fifth state" I was talking about earlier. It does not correspond to any state visible in the puzzles, but I feel that it gives a good idea of whether the design performs as planned, e.g. exactly at the cutoff ratio of 1/4, the MS2 signal should be about average.
There are a few details that can be tuned in the above procedure. Just to mention one of them, one could argue that design dimers are nearly impossible since these RNA strands are tethered to the polymerase, itself still locked on the DNA template, which is itself sticking to the surface of the microarray chip. It is possible to tune the input parameters to take this effect into account, but it is my experience that the impact on the predictions is quite negligible.
Also, the probability of presence/absence of a MS2 hairpin is not directly predictive of the observed luminescence, and the free energy of binding for the MS2 protein does affect the folding of the RNA. I'm not sure what the experimental protocol will be exactly, but I can't really imagine that the lab folks are going to scan all variables, which in this case are:
- TB-A concentration
- TB-B concentration
- MS2 concentration
In order for them to characterize the behavior of the designs in the (TB-A, TB-B) 2-dimensional space, I assume that they will test like 5~6 concentrations in both directions, which is already 25~36 measurements. I would bet that they won't probe at many different MS2 concentrations, and most likely at only one. How all this will be factored in into the evaluation and scoring is unknown to me.
As a final word, I can't really tell you whether any set of numbers you'll be getting from the above-described procedure are really "good" or terribly "bad". In broad terms, the table above gives some indication of the general expectation. And I just hope this exposé will help some of you getting started.