RNA graph
From Purdue Genomics Database Facility
| RNA Home | RNA Structure Match | RNA_Structure_Datasets | RNA_Papers |RNA_misc_(Internal_discussion)|
Please help by researching items labeled "needs reader". leave a note that you are doing so using the four tildes '~~~~'
Contents |
CVS
CVS repository:/cbio/server/lib/CVS_Repository
- CVS Web browser(must register with plantsp)
- more information on CVS, and our installation is on the group wiki
Nomenclature
The idea of this project is to use graph theoretical approaches to find the greates common folded structure in an ad hoc set of RNA sequences. In outline
- Each atomic stem (folded unit) is a node of graph
- Edges represent relations between atomic units, which can be
- nested or internal, stem b is in the loop of stem a (edge:i)
- serial, stem a and stem b are a non-overlapping sequence (edge:s or none)
- overlapping, e.g. pseudoknotted, half of stem b is in the loop of stem a (or vice versa) (edge:o)
- incompatible (because they use the same sequence), stem a and stem b would use the same sequence ( edge:x)
There should be some examples here, but the drawings are hard to make.
Requirements
- The structures to be considered must include pseudoknots
- The approach must work in the presence of sequences that, in fact, do not have a common structure from the biological point of view. This is very important because a biologically selected set of sequences will rarely be clean - contamination with "junk" sequences is to be expoected
- Do we need approximate subgraph matching? Or because we can enumerate all graphs up to a certain size, is exact matching sufficient (at least for initial screening).
- Do we need induced graph matching?
Known Problems and Approaches
- graph isomorphism - see the wikipedia introductory article
- subgraph isomorphism - For graphs G1 and G2 does G2 contain a subgraph isomorphous to G1?
- Induced subgraph isomorphism problem - for gaaphs G1=(V1, E1) and a graph G2=(V2, E2) of lesser or equal order to G1.
, can one map the vertices of G2 to vertices of G1 such that for all vertices x, y in V2, edge (x, y) is in E2 if and only if the edge (f(x), f(y)) is in E1. This is different from the in that the inducted subgraph in G1 cannot have edges which aren't also in G2. In subgraph isomorphism, these "extra" edges may be present. (excerpted from Wikipedia) - Maximum common subgraph (MCS) isomorphism problem - For graphs G1 and G2, what is the largest subgraph of G1 that is isomorpus to a subgraph of G2? One possible solution for this problem is to build product graph, in which the largest clique (see below) represents a solution for the MCS problem. MCS algorithms have a long tradition in cheminformatics and pharmacophore mapping, but do not seem to be the best solution for finding the greates common structure in a contaminated set of RNA sequences. See also the folowing Wikipedia articles:
- Clique problem - A clique in a graph is a set of pairwise adjacent vertices, or in other words, an induced subgraph, which is a complete graph. The clique problem is the problem of determining whether a graph contains a clique of at least a given size k. Once we have located k or more vertices which form a clique. The corresponding optimization problem, the maximum clique problem, is to find the largest clique in a graph. (excerpted from Wikipedia)
- mining structured data - the graphs link on this page contains an extensive list of papers, algorithms and papers (needs reader) Prasad
References for general approaches
- SAGA: a subgraph matching tool for biological graphs. Yuanyuan Tian, Richard C. McEachin, Carlos Santos, David J. States and Jignesh M. Patel. Bioinformatics 23:232–239, 2007. doi:10.1093/bioinformatics/btl571 [[1]]
SAGA provides a tool for matching a query (<100 nodes) vs a database of preindexed graphs. Graph distance and indexing functions are very relevant but probably cannot be used without modification.Gribskov 10:24, 2 June 2007 (EDT)
- He,H. and Singh,A.K. (2006) Closure-tree: an index structure for graph queries. In Proceedings ICDE 2006, pp. 38–49. (needs a reader)
- Shasha,D. et al. (2002) Algorithmics and applications of tree and graph searching. In Proceedings of PODS 2002, pp. 39–52. (needs a reader) Aditi 18:10, 3 June 2007 (EDT)
- Yan,X. et al. (2004) Graph indexing: a frequent structure-based approach. In Proceedings of SIGMOD 2004, pp. 335–346.(needs a reader) Reazur 3:10, 4 June 2007 (EDT)
- Yan,X. et al. (2005) Substructure similarity search in graph databases. In Proceedings of SIGMOD 2005, pp. 766–777.(needs a reader) Aditi 17:43, 8 June 2007 (EDT)
- Yan,X. et al. (2006) Searching substructures with superimposed distance. In Proceedings of ICDE 2006, pp. 88–99.(needs a reader)
Alternative approach to graph matching problem
These two papers describe tractable way of mining frequent graphs. The idea of generating DFS codes for enumerating structures might be helpful for us.
- gSpan: Graph-Based Substructure Pattern Mining, by X. Yan and J. Han. Proc. 2002 of Int. Conf. on Data Mining (ICDM'02).
- Technical Report for gSpan
- CloseGraph: Mining Closed Frequent Graph Patterns, by X. Yan and J. Han. Proc. 2003 of Int. Conf. Knowledge Discovery and Data Mining (SIGKDD'03)
- A maximum common substructure-based algorithm for searching and predicting drug-like compounds, by Yiqun Cao, Tao Jiang and Thomas Girke Bioinformatics Advance Access published on July 1, 2008, DOI 10.1093/bioinformatics/btn186.Bioinformatics 24: i366-i374.
A working version of gSpan is available here
Related Software
Packages we might use
- Graphviz - an open source graph visualization software with several graph layout programs.
- nauty - (no automorphisms, yes?) is a set of procedures for determining the automorphism group of a vertex-coloured graph.
- Vienna RNA package - The Vienna RNA Package consists of a C code library and several stand-alone programs for the prediction and comparison of RNA secondary structures.
- MFold - Michael Zuker's minimum free energy RNA folding program (and server)
