LabDataMiner

From Eterna Wiki

The Lab Data Miner tool was stimulated by prior work on Computationally Selected Elements and Omei's Cloud Lab Data Mining Tool. The idea is to enable users to search for examples of sequences that have been tried as (sub-)structure designs and have been synthesized in previous labs. The user is allowed to constrain both the search and the display of results.

An initial attempt to create a cross-lab Lab Data Miner within the EternaScript environment proved to be too slow to be practical, often taking on the order of 15-20 minutes to run a single search. After some experimentation, it was discovered that most of this time was taken up in loading the lab results data using the database APIs. Since this data changes only infrequently, it is significantly faster to prefetch and slightly preprocess this data to create a single javascript file containing the lab results data that can be loaded in a few seconds, allowing the actual data mining queries to be run in the user's browser, typically in seconds and often in less than a second. Having loaded the data once, multiple searches can be run almost instantly.

The Lab Data Miner is currently under active development. A quasi-stable version can normally be found at http://166.78.137.98:8888/ldm.html. This version is periodically updated as new development releases are made.

Substructure Specifications

The primary use case for the tool is searching for tried and tested examples of a given structural component, like a tetra-loop or 4-way 0-0-0-0 multi-loop. The dataset that is searched is the set of all previously synthesized lab results.

The substructure search field can be used to specify a substructure pattern to look for in the lab results. Dot and round brackets can be used with the usual meanings. In addition, a vertical bar '|' (or exclamation mark  '!') can be used to indicate a balanced pruned structure. For example "(.(|).)" looks for 1-1 loops with just the closing pairs and unpaired bases, while "(((.(((|))))))" looks for 1-0 bulges with attached 3-stacks. Likewise "((((|))((|))((|))))" looks for a 0-0-0-0 (4-way) multiloop with at least attached 2-stacks. For those with non-US-ASCII keyboards, an exclamation mark '!' may be substituted for the vertical bar '|'.

Adding a leading '|' to the search anchors it to the "hook" area, so that "|(((((((....)))))))." will look for something that looks like a bar-code in the hook area, and "|(|)..(((((((....)))))))." will match cases where a lab structure is separated from a barcode by exactly two unpaired bases. 

Lab-Based Restrictions

Two other search fields can be used to refine the results at the lab level: lab id and omit structs.

Using lab id you can specify particular labs whose results are to be explcitly included or excluded from consideration. If you specify a lab number or list of blank separated lab numbers, you can restrict the results to come from only those labs. If you specify a set of one or more lab numbers that are each prefixed by a minus sign (e.g. -17320) you can exclude those labs from consideration.

The omit structs field can be used to specify a blank-separated list of zero or more substructures that must not be found in the secondary structure of the labs being considered. For example if for some reason you wanted to exclude labs that included triloops or 1-1 loops, you could specify "(...) (.(|).)" in the omit structs field. If either of these structures was found in the lab, it would not be used.

You can also specify whether the search will include early lab round results, an/or results from the newer lab process that uses bar-codes. The selection for these is enabled using the Early Labs and Cloud Labs options.

If you are simple interested in finding which (if any) labs have a given substructure pattern, you can specify the just labs option and it will list the locations of the structure matches found within the secondary structure of any synthesized labs (with links to the lab results pages). 

Design/Results-Based Restrictions

The filterBy field is a supplement to substructure with aligned values for sequence matching/filtering. For example with a substructure of '((....))', a filterBy of 'NNGNRANN' would select only GNRA tetraloops with attached 2-stacks and 'NSUNCGSN' would select only UNCG tetraloops with strong (S={C,G}) closing pairs. If the substructure field has a '|', the filterBy should have a '|' at the same relative offset.

In addition to the regular nucleotide characters, A, C, G, U, the standard RNA grouping characters can be used for the filterBy field: B={C,G,U}, D={A,G,U}, H={A,C,U}, V={A,C,G}, K={G,U}, M={A,C}, R={A,G}, S={C,G}, W={A,U}, Y={C,U}, N={A,C,G,U}.

<img src="https://lh3.googleusercontent.com/dq5-Kxo1s9WA4AiqBwd0OAMhl0wG9Q1Uh1m3__-nzpt5FLAITSm4uSdKeZoTIYTwY1vhRBctik3_D0vwDPrlU_E-e3T9aPmrOgOFt0BpzlT0GmqOG9zE8Kgg" alt="" />

Extra memory rules for pyrimidine and purines:

Pyrimidines bases: CUT - pygmy short = cut short - short bases

Purines bases: AG - a giant = long bases

 

The groupBy field is a supplement to substructure for the grouping of results. For example with a 1-1 loop selecting substructure of '((.((|)).))', a groupBy of 'NXNXN|NXNXN' would create 'XXXX' groups based on the closing pairs, where the Xs are ungrouped bases. Likewise, with a substructure of '((....))' and filterBy of 'NNGNRANN', a groupBy of 'NNXXXXNN' would categorize by the hairpin configurations, clustering together alternative stack configurations, while a groupBy of 'XXNNNNXX' would categorize by stack design. The order of the bases (or subgroups) in the group names matches the order found in the groupBy pattern and sequences.

You can also restrict the search to labs results that match certain criteria. For example, to consider only results that score within a given range you can specify optional min score and max score values (defaults 0 and 100 respectively). 

You can also restrict the search to inlude only lab results where some particular subsequence is found, such as 'GGG' for sequences with 3 or more GGGs somewhere (anywhere) in the design (not just in the targeted substructure). The omit sequences field does the opposite, excluding from consideration designs that have the listed sequence(s) anywhere within the design.

Output Controls

For output, the default is a tabular summary, but by specifying a mode of csv you can get an Excel spreadsheet compatible listing of results. Specifying json as the output mode will dump the results in a format that is easy to import into JavaScript and other JSON compatible tools. Both the csv and json output include more data than is shown in the standard tabular display.

Various types of coloration of results are also currently being experimented with:

  • base: show ACGU in standard base colors
  • CBNF: Continuous Background, Neutral Foreground
  • CBTF: Continuous Background, Target Foreground
  • CFNB: Continuous Foreground, Neutral Background
  • CFTB: Continuous Foreground, Target Background

See also

CSE using LabDataMiner for Late Bulge

CSE for Reversed G-C Multibranch Loop Closing Base Pairs 3

Help page for LabDataMiner