EvoPrinterHD
Introduction

EvoPrinter is a comparative genomics tool for discovering conserved DNA sequences that are shared among three or more orthologous DNAs (Odenwald et al., 2005). Only a single curated DNA sequence is required to initiate the rapid comparative analysis. Generated from multiple pairwise BLAT alignments (Kent, 2002), an EvoPrint presents an ordered, uninterrupted representation of evolutionarily resilient sequences within the user's DNA of interest. By superimposing the different species evolutionary histories, the combined 'in silico' mutagenic force reveals DNA sequences that are essential for gene expression and function.

EvoPrinterHD is a 2nd-generation comparative tool that automatically superimposes higher-resolution alignments to give an enhanced view of sequence conservation between evolutionarily distant species (Yavatkar et al., 2008). Currently available for 6 nematode, 3 mosquito, 12 Drosophila, 26 vertebrate and 59 bacteria genomes, EvoPrinterHD employs a modified BLAT algorithm (enhanced-BLAT) to facilitate the discovery of short conserved sequence blocks. Enhanced-BLAT represents three superimposed BLAT alignments of the same genomic sequence that were generated using different search and alignment parameters. When alignments between evolutionarily distant genomes are compared, enhanced-BLAT detects up to 75% more conserved bases than identified by original BLAT alignments. The new program also queries 3 different enhanced-BLAT alignments from each genome to identify sequence rearrangements and/or duplications, in addition to detecting sequencing gaps. EvoPrinterHD currently holds over 170 billion bp of indexed genomes in memory and has the flexibility of selecting a subset of genomes for analysis. An EvoDifferences profile is also generated to portray conserved sequences that are uniquely lost in any one of the orthologs. Finally, EvoPrinterHD incorporates options that allow for (1) superimposition of multiple enhanced BLATs to highlight all conserved sequences when orthologous DNAs contain rearrangements; (2) re-initiation of the analysis using a different genome's aligning region as the reference DNA to detect species-specific changes in less-conserved regions and (3) rapid extraction and curation of conserved cis-regulatory sequences.

EvoprinterHD features

EvoprinterHD procedures

Indexing of Genomes: In addition to the original non-overlapping 11-mer genomic index of BLAT, EvoPrinterHD indexes each genome into a second set of non-overlapping 11-mers, offset by four base pairs from the initial indexing, and into a third set of non-overlapping 9-mers. The resulting staggered indexing increases the likelihood that homologous regions missed by any one of the individual indexes will be identified. The use of multiple genome indices and optimization of the alignment phase parameters (see below) is the basis of the enhanced detection of conserved sequences between evolutionarily distant orthologous DNAs. EvoPrinterHD currently holds in memory three independent indices of each of 5 nematode (Caenorhabditis elegans, C. brenneri, C. briggsae, C. remanei and Pristionchus pacificus), 3 mosquito (Aedes aegypti, Anopheles gambiae and Culex pipiens), 12 Drosophila (D. melanogaster, D. simulans, D. sechellia, D. yakuba, D. erecta, D. ananassae, D. persimilis, D. pseudoobscura, D. willistoni, D. virilis, D. mojavensis and D. grimshawi) and 26 vertebrate (human, chimpanzee, orangutan, rhesus, marmoset, mouse, rat, guinea pig, dog, cat, horse, cow, hedgehog, elephant, armadillo, platypus, opossum, chicken, lizard, X. tropicalis, Fugu, Tetraodon, Medaka, stickleback, zebrafish, and lamprey) genomes, 17 Staphylococcus genomes and 22 Enteric bacteria representing ~170 billion bp in total memory.

Genome sequence files and their assembly dates: The following genome sequence files were curated from the Genome Bioinformatics Group of University of California, Santa Cruz: Human, February 2009 (hg19); Chimpanzee, October 2010 (panTro3); Orangutan, July 2007 (ponAbe2); Rhesus, January 2006 (rheMac2); Marmoset, March 2009 (calJac3); Guinea Pig, February 2008 (cavPor3), Rat, November 2004 (rn4); Mouse, July 2007 (mm9); Cat, December 2008 (felCat4); Dog, May 2005 (canFam2); Horse, September 2007 (equCab2); Cow, October 2007 (bosTau4); Opossum October 2006 (monDom4), Platypus, March 2007 (ornAna1); January 2006 (monDom4); Chicken, May 2006 (galGal3); Lizard, May 2010 (anoCar2); Xenopus tropicalis, August 2005 (xenTro2); Zebrafish, July 2010 (danRer7); Tetraodon, March 2007 (tetNig2); Fugu, October 2004 (fr2); Stickleback, February 2006 (gasAcu1); Medaka, October 2005 (oryLat2); Lamprey, September 2010 (petMar2); D. melanogaster, April 2006 (dm3); D. simulans, April 2005 (droSim1); D. sechellia, October 2005 (droSec1); D. yakuba, November 2005 (droYak2); D. erecta, August 2005 (droEre1); D. ananassae, August 2005 (droAna2); D. pseudoobscura, November 2005 (dp3); D. persimilis, October 2005 (droPer1); D. virilis, August 2005 (droVir2); D. mojavensis, August 2005 (droMoj2); D. grimshawi, August 2005 (droGri1); C. elegans, January 2007 (ce4); C. brenneri, January 2007 (caePb1); C. briggsae, January 2007 (cb3); C. remanei, March 2006 (caeRem2); and P. pacificus, February 2007 (priPac1); The genome sequence files for Elephant, June 2005; Hedgehog, June 2006 and Armadillo, June 2005 were downloaded from the Broad Institute. The mosquito genome sequence files for Aedes aegypti, Anopheles gambiae and Culex pipiens were curated from the VectorBase database. The following bacteria genome sequence files were curated from the BacMap database of University of Alberta: Staphylococcus aureus COL; Staphylococcus aureus MRSA252; Staphylococcus aureus MSSA476, Staphylococcus aureus Mu50; Staphylococcus aureus MW2; Staphylococcus aureus N315; Staphylococcus aureus subsp. aureus NCTC 8325; Staphylococcus aureus RF122; Staphylococcus aureus subsp. aureus USA300; Staphylococcus epidermidis ATCC 12228; Staphylococcus epidermidis RP62; Staphylococcus haemolyticus JCSC1435; Escherichia coli 536; Escherichia coli APEC O1; Escherichia coli CFT073; Escherichia coli O157:H7 EDL933; Escherichia coli K12 MG1655; Escherichia coli W3110; Escherichia coli O157:H7 Sakai; Klebsiella pneumoniae MGH 78578; Salmonella enterica Choleraesuis SC-B67; Salmonella enterica Paratypi A ATCC 9150; Salmonella typhimurium LT2; Salmonella enterica CT18; Salmonella enterica Ty2; Shigella boydii Sb227; Shigella dysenteriae Sd197; Shigella flexneri 2a 2457T; and Shigella flexneri 301. The genome sequence files for Staphylococcus aureus subsp. aureus JH1, Staphylococcus aureus subsp. aureus JH9, Staphylococcus aureus Mu3, and Staphylococcus aureus subsp. aureus str. Newman were curated from the European Bioinformatics Institute of the European Molecular Biology Laboratory. The genome sequence file for Escherichia coli UT189 was taken from Enteropathogen Resource Integration Center, and genome sequence data for Salmonella bongori was downloaded from the Sanger Institute Sequencing Centre.

Search and Alignment Parameters: The alignment sensitivity of EvoPrinterHD for the discovery of short blocks of conserved sequence homology between evolutionary distant orthologs was increased by optimizing the Genomic Finding (gf) client program parameters of the original BLAT algorithm. The search and alignment parameters were adjusted by: (1) optimizing the stringency factor for low homology alignments by increasing it from 0.0005 to 0.001; (2) reducing the initial expansion gap between adjacent hits from a setting of four to three; (3) reducing the additional expansion gap penalty from three to one; (4) maximizing the allowable gaps and inserts from 12 to 16; and (5) changing the value of allowable codon gap parameter from two to three to optimize for codon polymorphisms in open reading frames.

eBLAT Alignments: As an output of the client program, EvoPrinterHD generates a superimposed composite of each of the three different indexed genome BLAT alignments. The algorithm does this by first creating an array of nucleotide strings of each input reference DNA BLAT alignment and then loops through the strings one base at a time, outputting a capital letter when at least one of the readouts has an aligning base at that position, thereby generating a composite readout that displays all conserved bases. The program also generates BLAT readouts of the test genome aligning region and both are stored in memory for later analysis, EvoPrint generation and for exchange of input reference DNA, accomplished by selecting one of the aligning region sequences as the new reference sequence to reinitiate the analysis. The algorithm also generates eBLATs for the second and third highest score aligning regions for each of the selected genomes.

The nematode and Drosophila EvoPrinterHD algorithms automatically generate 45 and 108 pair-wise BLAT alignments and then assembles 15 and 36 eBLAT readouts, respectively (3 eBLATs per genome), while the vertebrate EvoPrinterHD generates up to 225 pairwise BLAT alignments assembling 75 eBLAT readouts, and the Staphylococcus program generates up to 144 pair-wise BLAT alignments, assembling 48 eBLAT alignments. To reduce alignment times, EvoPrinterHD currently employs two RedHat Enterprise Linux 5 (RHEL5) based (one 2.8 GHz/128GB RAM/Dual quad-core processor server and other 2.8 GHz/64GB RAM/Dual quad-core processor server) servers (Dell PowerEdge 6950 series) operating in parallel with the Network File System to simultaneously query multiple indexed genomes.

To assess the efficacy of eBLAT alignments in comparison to the original BLAT, we determined the percent increase in pair-wise alignment scores (the total number of aligning bases in the input DNA) of eBLAT to that obtained with BLAT using 10 different intergenic regions from the Drosophila melanogaster genome. eBLAT exhibited only a modest increase in the identification of shared sequences between closely related species, however, eBLAT identified significantly more conserved sequences when the D. melanogaster genomic fragments were aligned to the more evolutionarily distant orthologs. Increased identification of shared sequences varied from a 7.5% increase for D. simulans (evolutionary divergent time from D. melanogaster is ~2 My) to 74.8% for D. grimshawi (separated from D. melanogaster for ~40 My). The same enhanced discovery of sequence conservation was also observed when evolutionarily distant vertebrate or nematode species were compared. For example, eBLAT alignments between human and Xenopus or C. elegans and C. briggsae orthologous DNAs identified 76% and 85% more shared sequences when compared to original BLAT alignments, respectively.

Repetitive Element Finder: A prominent feature of most, if not all, metazoan genomes is that they harbor diverse populations of repetitive elements that range in copy number from single duplications to thousands of transposable elements dispersed throughout the genome. Given that many of these repeats contain highly conserved sequences that may interfere with alignments between evolutionary distant orthologs, it is important to first identify the repetitive sequence(s) within the DNA of interest before any comparative analysis is considered. To accomplish this, the EvoPrinterHD repeat finder algorithm superimposes the first, second and third highest scoring eBLAT alignments of the input DNA to its resident genome and then color-codes the readout to identify single or multiple repeat sequences within the DNA of interest. Sequences that have one additional copy in the reference genome are noted with blue-colored uppercase bases while those that are present three or more times are highlighted with red-colored bases. The algorithm also reveals if one of the multiple repeat sequences is more homologous to the repeat present in the input DNA by highlighting single repeat sequences that flank the core multi-repeat element.

eBLAT Alignment Scorecard: For the inter-genomic comparative analysis, the program first displays the results of the different species alignments in a tabular form referred to here as the alignment scorecard. The alignment score for each eBLAT represents the total number of aligning bases in the input reference DNA and the start and end of the aligning bases within the input DNA is also noted. This information is also shown for the second and third highest scoring eBLAT alignments. Links to the alignments from this page are provided, allowing the user to view the reference DNA eBLATs and selected species BLATs. Input DNA - species alignments are arrayed in descending order of the highest eBLAT score, so that the user can optimize the choice of species for the generation of an EvoPrint, and by default, the algorithm automatically selects the highest alignment scores of the top scoring genomes for the initial EvoPrint analysis. From the selected genomes, the algorithm then deselects those in which sequencing gaps have been detected in the highest scoring alignments. The number of sequencing gaps detected in the aligning regions and the total number of 'Ns' are also noted on the scorecard. It is recommended that the lower scoring alignments and those containing sequencing gaps be included one at a time to extend the EvoPrint analysis to additional species. The scorecard also highlights the presence of potential rearrangements and/or duplications of MCSs in the aligning regions. This is determined, as described in detail below, based on the presence of aligning sequences in the second and third alignments that are either present or absent in the first alignment.

Generating an EvoPrint and EvoDifferences Profile: Based on the data provided in the scorecard, different combinations of input DNA eBLAT alignments can be chosen to generate an EvoPrint. The EvoPrinter algorithm creates an array of nucleotide strings from each of the selected eBLAT alignments and then looks for conservation of sequence by looping through each of the strings one base at a time, outputting an uppercase base for only those input reference DNA nucleotides that are aligned in all of the different species eBLATs included in the analysis. Those DNA bases within the input DNA that are not shared with all species are represented as lowercase nucleotides. The 'Select/Deselect' option for each genome's eBLAT alignments allows for rapid changes in the repertoire of species alignments used to generate an EvoPrint. Progressive evolutionary changes can be quickly assessed and the discovery of conserved sequences within rearranged DNA is also rapid. For the 25 vertebrate species, genomes can be added or removed from the initial analysis simply by returning to the selection page and adding or deselecting different genomes. Because EvoPrinterHD holds the previous alignments in memory, the time required to add additional genomes to the comparative analysis is significantly reduced.

An additional readout, the EvoDifferences profile, is also displayed along with the EvoPrint; it highlights the unique differences (conserved sequence losses) that each species contributes to the comparative analysis. Genomic rearrangements and/or deletions that span conserved sequences, that are unique to one of the genomes included in the analysis, are identified by clusters of single one-color sequences. The EvoDifferences profile can also be considered a 'relaxed EvoPrint' since bases identified by the different colors are present in all species except for the species depicted by each color. The apparent absence of a conserved sequence or base-pair from a single species could have several explanations: (1) the difference represents a true evolutionary change; (2) a sequencing error; and/or (3) the sequence is present but was not identified by the eBLAT alignment.

Generating an EvoUnique Print: For bacteria, a third readout, the EvoUnique Print, is displayed along with the EvoPrint and EvoDifferences profile; it highlights sequences unique to a species or shared by only one or two other species. Lowercase gray-colored bases are common to three or more of the test species aligning regions. The EvoUnique print reveals sequences that are uniquely present in one or two of the analyzed species that is not present in other species used in the analysis. Presence of sequences revealed by the EvoUnique print could reveal loss of a gene from most species of a lineage or horizontal gene transfer into a few members of a lineage.

Identification of Rearranged and/or Duplicated Conserved Sequences: Once eBLAT alignments are completed, the intra-genomic comparative algorithm automatically searches the 3 highest scoring eBLAT alignments of each species for rearranged and/or duplicated MCSs by comparing the 2nd and 3rd highest aligning regions with the highest scoring eBLAT. The algorithm first identifies rearranged conserved sequences by determining the number of aligning bases that are not identified in the highest scoring alignment but are unique to the 2nd or 3rd scoring alignment. The number of unique aligning bases is noted as the alignment R value on the scorecard for each 2nd and 3rd alignments. Next, the number of common aligning bases that are shared between the 1st and 2nd, 1st and 3rd or common to all 3 alignments is determined and displayed as the alignment D value. By comparing the alignment scores and the R/D values of the 2nd and 3rd alignments, the extent of rearrangement(s) and/or duplication(s) in the test species can be determined. For example, if the alignment R value matches its alignment score and its D value is zero, then the 2nd or 3rd or both identify rearranged sequences. Species-specific changes in the arrangement and/or repetitive nature of conserved sequences can also be resolved by viewing the color-coded composite eBLAT (ceBLAT) for each species. To included rearranged MCSs in the EvoPrint, ceBLATs from all or a selected subset of reference DNA - species alignments can also be used.

Exchanging Input Reference DNA: EvoPrinterHD allows for the rapid exchange of the input reference DNA; it draws from memory the genomic sequence of the highest aligning region of any species identified in the initial analysis. Once a change in reference DNA is requested, the alignment process is automatically reinitiated using the highest scoring aligning region of the selected genome as the new input reference DNA.

Extraction and Curation of Selected Conserved Sequences: To facilitate the comparative analysis of different cis-regulatory MCSs, EvoPrinterHD allows for the rapid curation of conserved sequences by enabling the user to automatically extract and collate these sequences in both forward and reverse-complimented orientations. The 'extract conserved sequence' option provides for the automatic extraction, naming and consecutive numbering of 6 bp or longer conserved sequence blocks from selected regions of an EvoPrint or EvoDifferences profile. In addition to showing the EvoPrinted genomic region that contains the conserved sequences, the curated sequence list provides links to cis-Decoder algorithms (Brody et al., 2007 & Brody et al., 2011) that enable the comparative analysis of individual MCSs and allow for the generation of enhancer identity tag-libraries.

References

Odenwald WF, Rasband W, Kuzin A and Brody T. (2005). EvoPrinter, a multigenomic comparative tool for rapid identification of functionally important DNA. Proc. Natl. Acad. Sci. 102: 14700-5.

Kent WJ. (2002). BLAT-- the BLAST-like alignment tool. Genome Res. 12: 656-64.

Yavatkar AS, Lin Y, Ross J, Fann Y, Brody T and Odenwald WF. (2008). Rapid detection and curation of conserved DNA via enhanced-BLAT and EvoPrinterHD analysis. BMC Genomics.

Brody T, Rasband W, Baler K, Kuzin A, Kundu M, Odenwald WF. (2007). cis-Decoder discovers constellations of conserved DNA sequences shared among tissue-specific enhancers. Genome Biol. 8(5): R7.

Thomas Brody, Amarendra S. Yavatkar, Alexander Kuzin, Mukta Kundu, Leonard J. Tyson, Jermaine Ross, Tzu-Yang Lin, Chi-Hon Lee, Takeshi Awasaki, Tzumin Lee, Ward F. Odenwald (2011). Use of a Drosophila genome-wide conserved sequence database to identify functionally related cis-regulatory enhancers. Developmental Dynamics DOI: 10.1002/dvdy.22728.


Return to
EvoPrinterHD home.

[ National Institutes of Health (NIH) | Contact NINDS ]
[ Home | Disclaimer | Privacy Notice | Accessibility Compliance ]
[ National Institute of Neurological Disorders and Stroke (NINDS) | FirstGov | Department of Health and Human Services ]


H H S Logo - link to U. S. Department of Health and Human Services     N I H logo - link to U. S. National Institutes of Health    N I N D S logo - link to National Institute of Neurological Disorders and Stroke    FirstGov Logo - link To FirstGov