Newsflash

Mauve version 2.0 (Progressive Mauve) has been released.
Read more...
 
Home arrow Software arrow GRIL - Rearrangement
GRIL algorithms PDF Print E-mail


Overview

GRIL locates the breakpoints of inversions and rearrangements in DNA sequences. The breakpoints located by GRIL represent the boundaries of collinear regions of sequence called Locally Collinear Blocks (LCBs). Each locally collinear block may contain non-homologous sequence regions but may not contain significant rearrangements of homologous sequence. In GRIL, significance of a particular homologous region is determined by user-specified parameters such as the length of the surrounding collinear region.

GRIL uses three basic steps to locate inversions and rearrangements. First, a string matching algorithm locates regions of sequence identity. Second, some of the matching regions found in the first step are filtered out based on user specified criteria. Finally, the remaining regions of sequence identity are organized into Locally Collinear Blocks (LCBs).


Step 1: Locating Regions of Sequence Identity

The first step of GRIL involves locating regions of sequence identity on which it bases its inference of collinear regions. The particular string matching algorithm implemented in GRIL locates maximal unique matches (MUMs) present in each sequence under study. A maximal unique match is an identical region in some set of sequences that could not possibly be extended to include a larger region without including a mismatching character. A MUM is unique because the complete character sequence in the region it covers is not repeated anywhere else in the DNA sequence. In other words, the DNA sequence in the region covered by a MUM is present exactly once in each genome.

Formally we define a MUM as the tuple $\langle L, S_1,...S_G\rangle$ where $L$ is the length of the MUM, $G$ is the number of sequences input to GRIL, and $S_j$ is the left-end position of the MUM in sequence $j$. The match finding step of GRIL outputs a set of MUMs $\mathbf{M} = \{M_1...M_N\}$. The $i^{th}$ MUM in $\mathbf{M}$ is referred to as $M_i$. To refer to the length of $M_i$ we use the notation $M_i.L$ and similarly, we refer to the left end of $M_i$ in sequence $j$ using the notation $M_i.S_j$. Finally, if MUM $M_i$ includes a region in the reverse complement orientation in sequence $j$, we define the sign of $M_i.S_j$ to be negative. For a further description of MUMs, see Alignment of Whole Genomes (Delcher et. al 1999).

To locate MUMs in several sequences simultaneously, GRIL implements an extension of traditional seed-and-extend style string matching to multiple sequences. First, a data structure called a Sorted Mer List (SML) is constructed for each sequence. The SMLs are then used to identify candidate MUM seed matches. If a candidate MUM seed is not part of an existing MUM, the seed is extended and added to a list of known MUMs.


Step 2: Filtering Matches

Once a set of MUMs has been identified, GRIL removes some MUMs from the set based on user-specified filtering criteria. There are four criteria that GRIL can use to filter MUMs at this stage:

  • A minimum length threshold for MUMs
  • A maximum generalized offset distance threshold for MUMs adjacent in the first sequence under study. The generalized offset of MUM $M_i$ is computed as

    \begin{displaymath} \sum_{j=1}^{G}\vert M_i.S_j - M_i.S_1\vert \end{displaymath}

    where $G$ and $S$ are defined as above. This filter allows removal of individual matches that would potentially suggest a rearrangement to a distant part of the sequence.

  • A minimum range threshold. The range of each collinear region is calculated based on the MUMs found in Step 1. Let a collinear set of MUMs be defined as $cs \subseteq \mathbf{M}$. The range of a collinear region for sequence $j$ can then be defined as  

    \begin{displaymath} Range( cs, j ) = \max_{M_a \in cs} \vert M_a.S_j\vert + M_a.L - \min_{M_b \in cs} \vert M_b.S_j\vert \end{displaymath}
     
  • A minimum identity threshold. The percent identity is calculated over each collinear region based on the MUMs found in Step 1. If the percent identity of a collinear region does not meet this threshold it is removed. Legal values for the identity are a number between 0 and 1. For example, 0.1 would indicate a minumum of 10% identity. If we define a collinear set of MUMs $cs$ as above, the identity of $cs$ can be defined formally as  
    \begin{displaymath} Identity( cs ) = \frac{\sum_{M_i \in cs}M_i.L}{\max_j Range(cs, j)} \end{displaymath}

    where $Range( cs, j )$ denotes the range of collinear set $cs$ in sequence $j$ as defined above.

GRIL applies the filters in the following order: minimum length threshold, maximum generalized offset threshold, minimum identity threshold, minimum range threshold.


Step 3: Determining Locally Collinear Blocks (LCBs)

In this final step, GRIL determines Locally Collinear Blocks based on the set of MUMs remaining after filtering has been applied. The output of this step is a set of LCBs. An LCB can be defined formally as a collinear region was defined in the previous section, specifically $lcb \subseteq \mathbf{M}$ where $M_i$ is the $i^{th}$ MUM in the LCB. The MUMs that constitute an LCB must satisfy a total ordering property such that $M_i.S_j \leq M_{i+1}.S_j$ holds for all i, $1 \leq i \leq \vert lcb\vert$ and all $j$, $1 \leq j \leq G$.

GRIL uses a standard breakpoint determination algorithm to partition the set of filtered MUMs into a set of LCBs. First, GRIL orders the MUMs on $\vert M_i.S_0\vert$. Next, a monotonically increasing label between 1 and $\vert lcb\vert$ is assigned to each MUM corresponding to the index of the MUM in the ordering on $\vert M_i.S_0\vert$. We will refer to the label of the $i^{th}$ MUM as $M_i.label$. Note that $M_i.label \in \mathrm{I\!N}$. Next, the set of MUMs is repeatedly reordered based on $\vert M_i.S_j\vert$ for $j = 2 ... G$. After each reordering, the set of MUMs are examined for breakpoints. A breakpoint exists between $M_i$ and $M_{i+1}$ if $M_i.label + 1 \neq M_{i+1}.label$ and both $M_i$ and $M_{i+1}$ are in the forward orientation, or if $M_i.label - 1 \neq M_{i+1}.label$ and both $M_i$ and $M_{i+1}$ are in the reverse complement orientation. A breakpoint also exists if $M_i$ is in a different orientation than $M_{i+1}$ in sequence $j$, e.g. the sign of $M_i.S_j$ is different than the sign of $M_{i+1}.S_j$.

The recorded set of breakpoints defines a partitioning of the set of filtered MUMs into a set of LCBs.


Memory Requirements

The sorted mer list construction phase is the most memory intensive component of the GRIL algorithm. To construct an SML, GRIL requires approximately 16 bytes of memory per base of the input sequence. Since each SML is written to disk after construction GRIL's memory requirements are proportional to the length of the longest sequence. For example, detecting rearrangements in 7 bacterial genomes, the largest of which is 5.5MB, would require approximately 88MB of memory.

Last Updated ( Wednesday, 25 October 2006 )
 
© 2014 Genome Evolution Laboratory