| GRIL algorithms |
OverviewGRIL locates the breakpoints of inversions and rearrangements in DNA sequences. The breakpoints located by GRIL represent the boundaries of collinear regions of sequence called Locally Collinear Blocks (LCBs). Each locally collinear block may contain non-homologous sequence regions but may not contain significant rearrangements of homologous sequence. In GRIL, significance of a particular homologous region is determined by user-specified parameters such as the length of the surrounding collinear region. GRIL uses three basic steps to locate inversions and rearrangements. First, a string matching algorithm locates regions of sequence identity. Second, some of the matching regions found in the first step are filtered out based on user specified criteria. Finally, the remaining regions of sequence identity are organized into Locally Collinear Blocks (LCBs). Step 1: Locating Regions of Sequence IdentityThe first step of GRIL involves locating regions of sequence identity on which it bases its inference of collinear regions. The particular string matching algorithm implemented in GRIL locates maximal unique matches (MUMs) present in each sequence under study. A maximal unique match is an identical region in some set of sequences that could not possibly be extended to include a larger region without including a mismatching character. A MUM is unique because the complete character sequence in the region it covers is not repeated anywhere else in the DNA sequence. In other words, the DNA sequence in the region covered by a MUM is present exactly once in each genome. Formally we define a MUM as the tuple To locate MUMs in several sequences simultaneously, GRIL implements an extension of traditional seed-and-extend style string matching to multiple sequences. First, a data structure called a Sorted Mer List (SML) is constructed for each sequence. The SMLs are then used to identify candidate MUM seed matches. If a candidate MUM seed is not part of an existing MUM, the seed is extended and added to a list of known MUMs. Step 2: Filtering MatchesOnce a set of MUMs has been identified, GRIL removes some MUMs from the set based on user-specified filtering criteria. There are four criteria that GRIL can use to filter MUMs at this stage:
GRIL applies the filters in the following order: minimum length threshold, maximum generalized offset threshold, minimum identity threshold, minimum range threshold. Step 3: Determining Locally Collinear Blocks (LCBs) In this final step, GRIL determines Locally Collinear Blocks based on the set of MUMs remaining after filtering has been applied. The output of this step is a set of LCBs. An LCB can be defined formally as a collinear region was defined in the previous section, specifically GRIL uses a standard breakpoint determination algorithm to partition the set of filtered MUMs into a set of LCBs. First, GRIL orders the MUMs on The recorded set of breakpoints defines a partitioning of the set of filtered MUMs into a set of LCBs. Memory RequirementsThe sorted mer list construction phase is the most memory intensive component of the GRIL algorithm. To construct an SML, GRIL requires approximately 16 bytes of memory per base of the input sequence. Since each SML is written to disk after construction GRIL's memory requirements are proportional to the length of the longest sequence. For example, detecting rearrangements in 7 bacterial genomes, the largest of which is 5.5MB, would require approximately 88MB of memory. |
|
| Last Updated ( Wednesday, 25 October 2006 ) |