Abstract
This paper presents the benefits of using XML, XSLT, and SVG technologies combined with JavaScript scripting in re-development of the genome visualization tool to provide users with simpler interface, maximized interactivity; as well as, improved efficiency of genetic data analysis.
GenomePixelizer is the Genome Visualization Tool that I co-developed in 2002. It was written using the TCL/TK toolkit and was designed to help with the visualization of the relationships between duplicated genes in genome(s) and to study the relationships between members of gene clusters [1].
GenomePixelizer proved itself useful [2, 3, 4] in the detection of duplication events in genomes, tracking the "footprints" of evolution, as well as displaying the genetic maps and other aspects of comparative genetics [1].
GenomePixelizer is not an intuitive tool to use. It provides a lot of functionality; however, it requires special data pre-processing: it takes in 3 input files (a file containing setup information, a file containing pre-processed genome information and a distance matrix file) and has a complicated user interface. GenomePixelizer requires the download and setup of the TCL/TK package. Large datasets need to be subdivided into smaller datasets and re-run through GenomePixelizer in order to see more detail.
The featured tool: GenomePixelizer SVG-fied is lightweight, dynamic and interactive. It takes in one XML file containing setup information, genome information and distance matrix and uses XML Style Sheets and SVG to plot genes over chromosomes and to identify duplicated genes. Furthermore, users can click on a specific gene and land on the NCBI entry for the specified gene. Since we are dealing with scalable graphics, users can also zoom onto the region of interest. In the near future users will be able to rearrange chromosomes by dragging them and move around clusters of genes for further analysis.
Table of Contents
Finding homology is important in order to trace the evolution of living organisms. Homology stands for similarity in structure due to common ancestry. In genetics, sequence alignment is used to align two or more DNA (or protein) sequences that are suspected to be homologous and to find the regions of conservation. Any difference in the produced alignment is due to mutation during evolution (insertion or deletion of nucleotides from the sequence).
DNA (or protein) sequences could be similar by chance. In which case if the sequence with unknown structure and function is aligned or searched against sequence with known structure and function and produces a high quality match, protein structure and function of the unknown sequence is assumed to be the one of the known sequence.
"Percent identity" is the degree of similarity of two or more sequences. If sequences have high percent identity it might be because they are homologous.
Figure 1. A sequence alignment, produced by ClustalW, of two human zinc finger proteins, identified on the left by GenBank accession number.[8]
Aligned region of sequences AAB24881 and AAB24882 has 83.7% identity which could be an indication that 2 sequences are homologous and huge gap in the alignment (positions 1-20 in AAB24882) is an indication there was a large insertion/deletion event sometime during evolution.
Pairwise alignment is a comparison of two sequences. There are two types of alignment: local and global. Local alignment is used for finding repeating regions within the same sequence or regions of similarity within dissimilar sequences. The purpose of global alignment is to produce the best match over the entire length of two relatively similar sequences. Dynamic Programming techniques are used for Pairwise alignment.
Multiple Alignment is used to align more than two sequences that are hypothesized to be evolutionarily related. Some of the goals of a multiple alignment are: to determine the phylogenetic relationships and trace evolution, determine conserved regions and determine overall structure of protein. ClustalW [9] is one of the popular tools used for multiple sequence alignment. ClustalW operates in three steps: perform Pairwise alignments – identity matrix is created, create guide tree (phylogenetic tree), and build progressive alignment.
Quick search in Google for genome visualization tools using "Genome Visualization" keywords have produced the following hit: the Argo Genome Browser - Broad Institute's tool for "visualizing and manually annotating whole genomes" [5], Circos - tool that visualizes "intra- and inter-chromosomal relationships within one or more genomes, or between any two or more sets of objects with a corresponding distance scale" [6], Alfresco - visualization tool that "allows effective comparative genome sequence analysis", and GenomePixelizer - a tool that visualizes the relationships between duplicated genes in genome(s) [1].
The first three tools mentioned: Argo, Circos and Alfresco come as standalone programs or webstart applications and allow multiple various functionalities for effective genome analysis. GenomePixelizer only comes as a standalone program. It provides a lot of functionality and allows for effective interactivity, however it has quite complicated interface. It also requires special data pre-processing: it takes in 3 input files (a file containing setup information, a file containing pre-processed genome information and a distance matrix file) and requires the download and setup of the TCL/TK package. Large datasets need to be subdivided into smaller datasets and re-run through GenomePixelizer in order to see more detail.
While all four above-mentioned tools have their benefits and strengths, they all have stale interfaces: all interactivity is implemented by means of links and pop-ups. The user is not able to drag clusters of genes away from chromosomes and quickly rearrange them by clicking and dragging.
The proposed tool: GenomePixelizer SVG-fied is lightweight, dynamic and interactive. It takes in one XML file containing setup information, genome information and distance matrix and uses XML Style Sheets and SVG to plot genes over chromosomes and to identify duplicated genes. Furthermore, users can click on a specific gene and land on the NCBI entry for the specified gene. Since we are dealing with scalable graphics, users can also zoom onto the region of interest. In the future, this tool will allow for "dragging": users will be able to drag the clusters of genes out in order to take a close-up look.
The prototype tool - GenomePixelizer was designed to help in “visualizing the relationships between duplicated genes in genome(s) and to follow relationships between members of gene clusters” [10].
GenomePixelizer is a visualization tool that “generates custom images of genomes out of the given set of genes. Each element on the picture has a physical address defined by coordinates (pixels), hence the name “GenomePixelizer”” [10].
GenomePixelizer was specifically developed for the analysis of the “evolution of NBS-LRR encoding genes in Arabidopsis relative to other genome duplication events”[10].
The following paragraph lists features and highlights of GenomePixelizer as described on GenomePixelizer website (http://atgc.org/GenomePixelizer/):
Written in Tcl/Tk and works on any computer platform (Unix/Linux, Windows, Mac) that support the Tcl/Tk toolkit.
GenomePixelizer does not need to be compiled; it works like Perl or Python scripts, using the Tcl/Tk language interpreter which can be downloaded for free at www.scriptics.com or tcl.activestate.com.
GenomePixelizer allows the display of desired features through the whole genome simultaneously. Generated images should fit into the user's computer monitor without scrolling. For larger genomes, it is possible to generate bigger images with a build-in scroll-bar.
Simple and flexible input file may be set up, edited and modified using any spreadsheet editor (e.g. MS Excel or StarOffice). Researcher can easily manipulate the set of genes of interest, add new sets, change or remove old ones and re-run program on a fly
Zoom in functionality, cluster viewing, minimal modification in the input file and some simple re-calculations allow the viewing of regions of high gene density in greater detail.
Regions with high gene density can be drawn using automatic or manual correction. Manual correction may produce nicer images; however with large set of genes it takes time.
GenomePixelizer allows the viewing of relationships between different sets of genes based on a distance matrix file.
The source of sequences is not restricted to a single organism and it is possible to view relationships between different genomes.
GenomePixelizer can be used to generate images of genetic maps with a given set of genetic markers. Instead of megabases, the size of chromosomes should be indicated in centiMorgans.
Generated images can be captured by any screenshot program and incorporated into Web pages. You can also save the generated image as a PostScript file.
GenomePixelizer can generate HTML ImageMap tags. This feature can be used to create "clickable" images for Web pages or online presentations.
The source code is freely available and minimal code modifications can add new features to the program. [10]
Figure 3. Output produced by GenomePixelizer – Arabidopsis thaliana, segmental duplications of chr IV and chr V[9]
GenomePixelizer SVG-fied is written using XML, XSLT, and SVG technologies combined with JavaScript scripting. All these technologies are browser interpreted and do not require download and installation of language environment. Some browsers may require the download of SVG plug in.
Like GenomePixelizer, GenomePixelizer SVG-fied provides zoomed out view of the whole genome.
Single XML input file is required and could be easily manipulated.
Zoom in functionality for GenomePixelizer SVG-fied is built into browser and is activated by pressing “Ctrl” and “+” keys simultaneously [See Figure 7]. No extra coding required. In GenomePixelizer zoom in functionality is a semi-automated process where you have to specify the coordinates of the region you would like to zoom in to.
GenomePixelizer SVG-fied does not allow for manual correction.
XML input file contains distance matrix information for viewing the relationships between different genes.
Like GenomePixelizer, GenomePixelizer SVG-fied allows to view relationships between different genomes.
These features do not exist in GenomePixelizer SVG-fied.
The source code is publicly available and could be easily modified to allow for new features.
The input to the program is a single XML file: Input.xml. The data is represented there in two parts:
Information about chromosomes: chromosome id and size and information about each gene located on this chromosome (gene name, location, Watson/Creek orientation and a color assigned to it).
<chromosome id="1" size="20"> <gene color="orange"> <gne_name>Gene_K</gne_name> <gne_location>6.2</gne_location> <gne_orientation>C</gne_orientation> </gene> <gene> <gne_name color="orange">Gene_W</gne_name> <gne_location>6.4</gne_location> <gne_orientation>C</gne_orientation> </gene> ... </chromosome> ...
Distance matrix: containing the distance information between two genes.
<matrix> <row><gene_a>Gene_A</gene_a><gene_b>Gene_E</gene_b><dist>0.9857</dist></row> <row><gene_a>Gene_A</gene_a><gene_b>Gene_U</gene_b><dist>0.9286</dist></row> <row><gene_a>Gene_A</gene_a><gene_b>Gene_Y</gene_b><dist>0.8429</dist></row> ... </matrix>
Currently, this input data is populated manually.
Single XML file that GenomePixelizer SVG-fied is using as input replaces 3 input files that original TCL/TK-based GenomePixelizer was using: Setup File, providing information about the widget's window size, number of chromosomes, size of chromosomes, cutoff values, etc.), Input File, containing number of chromosome, gene name, gene's location on the chromosome, orientation, and color and Distance Matrix File, containing pairs of genes and their distance ("similarity").
Resulting visual is an SVG graph that plots chromosomes, places genes over chromosomes, according to their specified locations and draws lines connecting genes with high “similarity” value.
The chromosomes are drawn according to their sizes in Mb (Mega bases). One grid interval represents 1 Mb. In the picture above, there are 3 chromosomes of sizes 20, 12 and 16 Mb. Genes are placed inside chromosomes according to their location. Genes’ opacity indicates Watson/Creek orientation. Genes with Watson (forward) orientation are represented with solid colors and genes with reverse orientation are represented with colors that are 40% opaque. Genes’ "similarity" is represented by means of lines and arcs: straight lines, if genes are "similar" to genes on different chromosomes and arcs if genes are "similar" to genes on the same chromosome. Similarity cutoff value (percent identity) is provided by the user.
Users can zoom into the area of interest by pressing “Ctrl” and “+” buttons simultaneously.
Once the user enters percent identity and clicks “Retrieve” button, SVG representation of XML data is displayed and a new browser window pops up displaying identity matrix [See Figure 8]. Identity values greater than user-specified percent identity are shown in black and identity values that are less than percent identity are grayed out.
User can click on any identity value in the matrix; the colorof the cell, containing that value will turn red and the line or arc connecting 2 genes in the graph will be highlighted in red and will become bolder [See Figure 9]. Once user clicks on a line within the chromosome, representing gene, gene information is displayed. User can click on gene name in a popup and he or she will land on an NCBI entry regarding this particular gene [See Figure 10].
The graphical portion of GenomePixelizer SVG-fied is written using XPATH, XSLT, and SVG. Interactivity is provided through JavaScript methods. The code is contained in 4 files: parser.xsl*, drawingtools.xsl*, draw_matrix.xsl and loadxmldoc.js**.
parser.xsl - parses out information about chromosome and genes using XPATH queries and sends it to drawingtools.xsl. It also parses out information about distance matrix and sends it to draw_matrix.xsl.
drawingtools.xsl - contains XSL templates and SVG code for drawing grid, chromosomes, and genes and for displaying synteny between genes.
draw_matrx.xsl - draws genes distance matrix.
loadxmldoc.js - loads XML document into DOM structure.
* layout of these files closely follows dinosaurs’ bar graph example, found in http://surguy.net/articles/client-side-svg.xml
** loadxmldoc.js is taken from http://www.w3schools.com/DOM/dom_loadxmldoc.asp
Currently, GenomePixelizer SVG-fied has significantly different way of presenting visual information than its predecessor - GenomePixelizer [compare graphical output of Figure 2 and Figure 5]. The author is still exploring the efficient ways to represent the information. Representing reverse gene orientation by lowering the opacity of the color is not visually optimal.
Implementation of the capability to rearrage objects on the canvas by dragging them would set this tool apart from currently available genome visualization applications.
[1] “GenomePixelizer - a visualization program for comparative genomics within and between species”. Copyright © 2007 February Bioinformatics. 18:335-336.
[2] “The Rice Kinase Database. A Phylogenomic Database for the Rice Kinome”. Copyright © 2007 Plant Physiology. 143(2): 579-586.
[5] Argo Genome Browser. BROAD Institute. http://www.broad.mit.edu/. 7 July 2008. http://www.broad.mit.edu/annotation/argo.
[6] Circos. Genome Sciences Centre. http://mkweb.bcgsc.ca/. May 22 2009. Martin Krzywinski. http://mkweb.bcgsc.ca/circos.
[7] Alfresco. Sanger Institute. http://www.sanger.ac.uk/. 2000. http://www.sanger.ac.uk/Software/Alfresco/.
[8] Homology (biology). Wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Main_Page. 2009. http://en.wikipedia.org/wiki/Homology_%28biology%29.