What is gril? ============== gril is a tool to detect the locations of genomic rearrangements in a set of sequences. Using gril ============== gril is a command line tool that takes sequence files as input and prints the coordinates of genomic rearrangements as output. The sequence files input to gril must have either a .fas, .gbk, or .seq file name extension that corresponds to the file format. .fas is used for the FastA file format, .gbk for the GenBank flat file format, and .seq for the DNAstar sequence format. A description of the command line options to gril can be generated by executing it with no options: C:\Development> gril gril [-i input file] [-l] [-m mer size] [-s min match size] [-f max offset diff] [-d min identity] [-r min range] [-o output file] [seq1 filename] [sml1 filename] ... [seqN filename] [smlN filename] Options: -m, --seed-size= Initial seed match size, default is 31 -o, --output= Output file name (Outputs to screen if not specified) -l, --locate-LCBs Locate locally collinear blocks (LCBs) -s, --min-match-size= Minimum length of matches in b.p. to use for LCB determination (Implies -l) -f, --max-offset-diff= Maximum permissible difference in generalized offset between adjacent matches (Implies -l) -d, --min-identity= Minimum LCB identity where is a real value between 0 and 1, default is 0 (Implies -l) -r, --min-range= Minimum LCB range in base pairs (Implies -l) -i, --match-input= Use specified match file instead of searching for matches An Example ============== As an example, let's detect rearrangements in the genome sequences of E. coli K-12 MG1655, Salmonella Typhimurium LT2, and Shigella Flexnerii 2a. The three genome sequences are in the following FastA format files: ecoli_k12.fas, typhimurium.fas, and shigella_flexnerii_2a.fas The command line syntax is: > gril -m 23 -r 15000 ecoli_k12.fas ecoli_k12.sml typhimurium.fas typhimurium.sml shigella_flexnerii_2a.fas shigella_flexnerii_2a.sml The above command line tells gril to find LCBs using a MUM seed size of 23, and a minimum rearrangement size of 15000 b.p. in all sequences. The ecoli_k12.sml, typhimurium.sml, and shigella_flexnerii_2a.sml files are sorted mer list files. gril creates these files if they do not already exist. The resulting output is: Sequence loaded successfully. ecoli_k12.fas 4639221 base pairs. Creating sorted mer list Create time was: 6 seconds. Sequence loaded successfully. typhimurium.fas 4857432 base pairs. Creating sorted mer list Create time was: 6 seconds. Sequence loaded successfully. shigella_flexnerii_2a.fas 4607203 base pairs. Creating sorted mer list Create time was: 6 seconds. 0%..1%..2%..3%..4%..5%..6%..7%..8%..9%..10%..11%..12%..13%..14%..15%..16%..17%..18%..19%..20%..21%..22%..23%..24%..25%..26%..27%..28%..29%..30%..31%..32%..33%..34%..35%..36%..37%..38%..39%..40%..41%..42%..43%..44%..45%..46%..47%..48%..49%..50%..51%..52%..53%..54%..55%..56%..57%..58%..59%..60%..61%..62%..63%..64%..65%..66%..67%..68%..69%..70%..71%..72%..73%..74%..75%..76%..77%..78%..79%..80%..81%..82%..83%..84%..85%..86%..87%..88%..89%..90%..91%..92%..93%..94%..95%..96%..97%..98%..99%.. Starting with 43234 MUMs There are 9065 3-way MUMs longer than 0 b.p. Filtering leaves 6544 MUMs And 15 breakpoints SequenceCount 3 Sequence0File ecoli_k12.fas Sequence1File typhimurium.fas Sequence2File shigella_flexnerii_2a.fas 0 Start: 172 172 1720 End: 638566 670847 545190 1 Start: 656605 693005 -6931411 End: 784885 823801 -572512 2 Start: 830703 886194 7743072 End: 1210698 1325533 1197455 3 Start: 1313968 -1825650 13122663 End: 1384860 -1777606 1381717 4 Start: 1435927 -1745007 -18602934 End: 1500900 -1697788 -1817081 5 Start: 1661308 -1574114 16379135 End: 1771318 -1441807 1756162 6 Start: 1793284 -1419362 -15517486 End: 1840865 -1379510 -1499912 7 Start: 1899038 1927489 -14492867 End: 1920168 1950502 -1427391 8 Start: 1921143 1968068 18869918 End: 3833944 3948671 3806499 9 Start: 3868357 4029505 -38820279 End: 3905996 4061515 -3839521 10 Start: 3909489 4069057 391913110 End: 4229053 4416460 4255223 11 Start: 4243073 4449594 -434419811 End: 4343810 4545917 -4262325 12 Start: 4360129 4566579 446617112 End: 4424105 4631923 4530805 13 Start: 4437555 4643782 -444802113 End: 4476293 4713174 -4408741 14 Start: 4595869 4800849 456459714 End: 4615673 4822212 4584754 Format of the output ==================== gril reports the start and end points of each collinear region shared by all sequences. The first (leftmost) column of output is a numerical identifier for each locally collinear block. The following three numerical columns specify the starting coordinates of the LCB in each genome and the final three numerical columns specify the ending coordinates of the LCB in each genome. Columns specify sequence coordinates for the LCB in the sequence input order. Thus in the example above, the row: 0 Start: 172 172 172 End: 638560 654958 545184 Designates that a collinear region exists among the three sequences that begins at base pair 172 in all three, and ends at 638560 in ecoli_k12, at 654958 in salmonella, and at 545184 in shigella. The full output of the above example describes the position of 15 collinear regions shared by the three genomes that are longer than 15,000 b.p. Negative sequence coordinates imply that the reverse complement strand of that sequence is homologous to the forward strand of the first (the reference) sequence.