Genome Annotation Transfer Utility (GATU) Documentation


Genome Annotation Transfer Utility (GATU) Documentation


Main Help Page

 

Introduction

Genome Annotation Transfer Utility (GATU) was designed to facilitate quick, efficient annotation of similar genomes using genomes that have already been annotated. For example, whenever a new strain of SARS coronavirus is sequenced, it is possible, using GATU, to automatically annotate the new strain using a previously-annotated strain of SARS CoV. This saves researchers from tedious manual annotation of these sequences.
The program utilizes tBLASTn and BLASTn algorithms to map genes from the reference genome (the annotated strain) to the new sequence (the unannotated strain). The goal is to annotate the majority of the new genome’s genes in a single step. ORFs present in the target genome and absent from the reference genome are also identified; these ORFs can be further analyzed using BLAST, VGO and BBB. Afterwards, they can either be accepted for/rejected from annotation. GATU can handle multiple-exon genes as well as mature peptides. Although it was designed for use with viral genomes, GATU can also be used to help annotate larger genomes (ie. bacterial genomes).
The output is saved in GenBank, XML, or EMBL file format.

 

 

Overview

A general overview of the features and functions of the Genome Annotation Transfer Utility (GATU)
Genome Annotation Transfer Utility
Components
The Genome Annotation Transfer Utility (GATU) consists of two components:

  • an application server, which runs programs like BLAST, NEEDLE, ClustalW etc and returns the results of the program executed to the requesting application
  • a Java Swing based graphical user interface for the display of the genomes and their annotations

The GATU Application Server
The primary goal of the application server is to provide access to standard applications installed on a powerful system or on a cluster of systems to client applications in the network. The client (GATU) simply sends a request to run a specific program to the application server, the application server will then run the program with the input contained in the request received and pass the output of the program back to the client. The main goal of this server is to allow client applications to run any bioinformatics application without requiring a local install of these applications. The figure below shows the flow of messages occurring in order to fulfill a blast request. By default GATU uses the application server running on leto.bioc.uvic.ca (142.104.33.4, TCP Port# 7777)
GATU Graphical User Interface (GUI)
The GUI of GATU consists of five units:

  • menubar providing access to preference settings, file saving formats, GATU statistics and help pages
  • genome selection field which allows you to select and load the two genomes [1] with which you wish to work: one reference genome and one genome you wish to annotate based on the reference genome you select
  • the annotations window, which displays the annotations of both genomes, BLAST results, NEEDLE results and ORFs
  • the genome map window providing a graphical view of the genome to be annotated and/or the reference genome and their annotations
  • the action buttons provide direct access to supporting applications like the annotation application (“Annotate“), Viral Genome Organizer (“VGO“) and Base-By-Base (“BaseByBase“): an alignment editor and the application used to save the annotation results (“Save“)

The annotation transfer process is started by choosing two genomes: an annotated genome (the reference) and an un-annotated or partially annotated genome. You will need to click on the “Annotate” button once the genomes are chosen in order to begin annotation of the latter.
For each gene in the reference genome, GATU will run a tBLASTn search for single-exon genes. For genes with multiple exons, GATU will run a BLASTn search for each exon. Based on the BLAST alignments and the reference gene, GATU will now infer a possible annotation on the un-annotated genome. The intention in this case is to provide all the data used in this process; BLAST alignments, NEEDLE alignments and reference genes are displayed for each suggested annotation.
The basic annotation transfer algorithm is shown in the figure below.
picture
GATU analyzes all BLAST output for the genome that needs to be annotated. Using the best BLAST hit, GATU applies the start and stop positions of the corresponding reference gene to the un-annotated nucleotide sequence, extending the exon if necessary, until the nucleotide sequence contains a start and stop codon. The exon is not extended in the case of matches with mature peptides from the reference genome or exons from the reference genome that do not contain a start and/or stop codon. The resulting protein sequences are then aligned with NEEDLE. Based on the similarity score of this NEEDLE alignment, the potential annotation is either accepted or not accepted. GATU will also find all possible ORFs in the un-annotated genome and display any ORFs not already found in the previous step as “Unassigned-ORF-#”. Once the automated annotation process is complete, you can review and modify the suggested annotations if necessary. GATU can further facilitate this decision-making process by conducting a tBLASTn or BLASTn search on request; this can be done by right-clicking on the row corresponding to the Unassigned ORF of interest and clicking on the appropriate button in the window that pops up.
Requirements
Please see System/Software Requirements for a detailed listing.
[1] If you wish to annotate more than one genome at a time (using one reference genome), please refer to the documentation for MGATU

 

 

Algorithm

Describes the basic algorithm used to annotate a genome
picture

  1. Read all genes/mature peptides of the reference genome and display them in the “Reference Genes” table.
  2. Conduct a tBLASTn/BLASTn alignment for every gene/mature peptide of the reference genome against the genome sequence to be annotated (tBLASTn is used for single exon genes and BLASTn for multiple exon genes).
  3. Use the highest scoring hit and make this a possible new gene/mature peptide. If the reference gene starts with a start codon, extend this hit to a start codon for the sequence to be annotated. If the reference gene ends with a stop codon, extend this hit to a stop codon for the sequence to be annotated. If the reference gene has no internal start/stop codon (atg/tag, tga, taa), verify hit has no internal start/stop codon and use longest orf if hit has internal start/stop codon.
  4. Run a NEEDLE alignment for each annotation found and mapped for the newly-annotated genome against the reference gene/mature peptide.
  5. Find all possible ORFs and display ORFs not found in step 3 in the “Unassigned-ORFs” table.
  6. Display all possible new genes/mature peptides found in step 3.
  7. Manual review and/or manual modifications.
  8. Apply annotations to genome and/or save annotations to file (GenBank, EMBL or XML (BSML) file).

 

 

User Interface

 
Userinterface

 

The user interface consists of several components:

  1. The Menu Bar
    • This provides access to features such as saving annotations to a file, setting preferences etc.
  2. Genome Selection
    Screen Shot 2013-11-18 at 9.18.03 AM

     

     
    • To load either a reference genome or a genome to annotate click on the “Upload Genome File” button. You will be prompted to enter the name and location of a GenBank file (GBxml or plain text format) for the reference genome and GenBank or FASTA file for the genome to be annotated.
    • Once both files are read, you may select the “Annotate” button to start the annotation process. A series of BLAST alignments (one for each gene in the reference genome) will be conducted.
  3. The New Annotations Tab
    • In the “Annotations table” all possible gene annotations for the genome to be annotated (excluding unassigned ORFs) are displayed. These genes are either homologs of genes in the reference genome or ORFs of length greater than the ORF length defined in the preferencesThe BLAST Alignments section displays the BLAST alignments for the selected gene annotation.
    • The NEEDLE Alignment section displays the NEEDLE alignment for the selected gene annotation.
  4. The ORFs & Reference Genes Tab
    • The Unassigned ORFs table displays the ORFs of the genome to be annotated (excluding the ORFs that have already been listed in the “Annotations table”).The Reference Genes Table displays the genes of the reference genome.
  5. The Genome Map
    • This is a graphical display of the reference genome and its genes/mature peptides (top genome) and the genome to be annotated with the potential gene annotations (bottom genome).
  6. The GATU Buttons
    • These buttons provide you with access to the main functions of the applications – such as “Annotate”, “VGO” (graphical view of the annotated genome), “Base-By_Base” (an alignment editor) and “Save” features.

 

 

Unassigned ORFs Table

Discribes the display/view of the unassigned ORFs found in the genome to be annotated
UnassignedOrfs

 

Lists the unannotated ORFs found in the genome to be annotated. Clicking on any row in the table results in the display of the DNA and corresponding protein sequences of the selected ORF in the narrow window directly below the “Unassigned ORFs” window. Clicking on a column header will re-sort the list of unassigned ORFs based on the column header you select. The columns “ORF Name” and “Gene Type” are editable.
To run a BLAST search (either BLASTp or tBLASTn) on any ORF in the table, select the ORF and right-click anywhere in the selected row. A window will subsequently pop up with a BLAST dialog box . You can now select the database to BLAST against as well as the output format. The protein sequence will be displayed in the sequence window of the BLAST Dialog window.
The columns in the table are as follows:
ORF Name : the name/number of the unassigned ORF
Start : start position of the ORF
Stop : stop position of the ORF
+/- : strand location of ORF (“+” (5′-3′) or “-” (3′-5′))
Size : the size of the ORF (listed as number of basepairs)
GeneType : gene/fragment/mature peptide
Score : BLAST score from either the automatic BLAST run or the most recent manual BLAST run
% Identity : BLAST identities from either the automatic BLAST run or the most recent manual BLAST run
Accept : a box provided to allow you to select or de-select ORFs to either add or exclude them from the annotation. To include an ORF in the annotations, click on the corresponding box – a checkmark should appear in the box.
By default, GATU uses the application server running on leto.bioc.uvic.ca (142.104.33.4, TCP Port# 7777) to run the requested BLAST searches.

 

 

Annotations Table

Describes the table showing the annotations

Lists the gene annotations found for the genome to be annotated. Clicking on any row in the table results in the display of the DNA and corresponding protein sequences of the selected annotation in the narrow window directly below the “Annotations” window. Clicking on a column header will re-sort the list of annotations based on the column header you select. All columns are editable except for “Size”, “P.Size”, “Score” and “% Similarity”.
The columns in the table are as follows:
Gene Name : the number of the gene to be annotated
Product : the product property from the CDS tag in the GenBank file
Exon# : the exon number (ie 1st, 2nd, etc exon)
Start : start position of the gene
Stop : stop position of the gene
+/- : strand location of gene (“+” (5′-3′) or “-” (3′-5′))
Size : size of the gene (listed as number of basepairs)
P.Size : size of parent gene (gene in reference genome) in basepairs
GeneType : gene/fragment/mature peptide
Score : BLAST score from either the automatic BLAST run or the most recent manual BLAST run
% Similarity : BLAST similarity from either the automatic BLAST run or the most recent manual BLAST run
Accept : a box provided to allow you to select or de-select ORFs to either add or exclude them from the annotation. To include an ORF in the annotations, click on the corresponding box – a checkmark should appear in the box.
Selecting a row in this table will highlight the corresponding gene in the “Genome Map” window.
Clicking “BLAST Alignment(s)” will display the BLAST alignments for the selected gene’s protein as shown in the window below. In the alignment output, the top sequence (query) is the the protein sequence that corresponds to the sequence of the gene belonging to the reference genome. The bottom sequence is that of the genome to be annotated (subject).
BLAST Alignment
Clicking “NEEDLE Alignment” will display the NEEDLE alignment for the seclected gene’s protein and its homolog as shown in the window below. In the alignment output, the top sequence is the the protein sequence that corresponds to the sequence of the gene belonging to the reference genome. The bottom sequence is that of the genome to be annotated.
NEEDLE Alignment

 

 

Reference Genes Table

Describes the table showing the reference genes
picture
Lists the genes of the reference genome. Clicking on any row in the table results in the display of the DNA and corresponding protein sequences of the reference gene in the narrow window directly below the “Reference Genes” window; the gene will also be highlighted in the “Genome Map” window. Clicking on a column header will re-sort the list of unassigned ORFs based on the column header you select. None of these columns are editable.
To run a BLAST search (either BLASTp or tBLASTn) on any reference gene in the table, select the gene and right-click anywhere in the selected row. A window will subsequently pop up with a BLAST dialog box . You can now select the database to BLAST against as well as the output format. The protein sequence will be displayed in the sequence window of the BLAST Dialog window.
The columns in the table are:
Gene Name: the number of the gene in the reference genome
Start: start position of the gene
Stop: stop position of the gene
+/-: strand location of the gene (“+” (5′-3′) or “-” (3′-5′))
Size: the size of the gene (listed as number of basepairs)
GeneType: gene/fragment/mature peptide
Family/Product: gene function (the product property from the CDS tag in the Genbank file)
By default GATU uses the application server running on leto.bioc.uvic.ca (142.104.33.4, TCP Port# 7777) to run the requested blasts.

 

 

The Genome map

Decribes the graphical view of the genomes and their annotations
picture
The uppermost genome map is that belonging to the reference genome and its annotations as displayed in the Reference Genes Table. The bottom genome map shows the genome to be annotated along with all accepted annotations from either the “Annotations table” or the “Unassigned-ORFs Table”.
Clicking on any gene shown in the map will display the name of the gene in the text box directly below the “Genome Map” window. The slider directly across from the text box provides a zoom function; you can increase the size of the image to a maximum of 20 times the original size. The reset button will reset the image to its original size and deselect all genes. The jump button will skip ahead in the genome map to the view of the genes selected in the “Annotations table”.
For a description of the preferences used for the “Genome Map”, refer to the “Gatu Genome Map Parameters” section of Preferences
If you would like to view all ORFs listed in the “Unassigned ORFs table”, you may do so in the preference settings; the unassigned ORFs will then be displayed as yellow bars in the genome map as shown below.
picture

 

 

 

GATU Main Menu Bar

Features offered in the GATU main menu bar
picture

 

The File Menu

picture
picture
The Preferences Menu allows you to define parameters for BLAST and NEEDLE and other annotation features within GATU; you can also set display preferences here.

  • By selecting “BLAST/NEEDLE preferences (manual process only)”, you can select the BLAST parameters (ie. expect value, which matrix is used, word size etc.) for BLASTp, BLASTn, tBLASTn, psiBLAST and BLASTx. If you select the arrow at the top of this “Application Preferences” window, you will be able to select “NEEDLE” from the subsequent drop down menu; it is here that you can set the preferences for NEEDLE alignments (ie. matrix used, alignment format, gap penalty value etc.).
  • To set the annotation preferences, select “Annotation Preferences (automatic process)”. You will be able to set preferences for the reference genes, unassigned ORFs, GATU genome map and GATU BLAST. These preferences include setting the gene location (top/bottom strand), seting the maximum % overlap for an ORF vs. an annotation, setting a minimum ORF length for possible annotations without a homolog, etc.
  • In the “Display Preferences” window, you can set the window size and location of the menu bar.

See GATU Preferences for a more in-depth description of the preference settings.

  • Save as EMBL…” saves the annotations you selected for the genome to be annotated as an EMBL file
  • Save as XML (BSML)…” saves the annotations you selected for the genome to be annotated as an XML file
  • Save as GenBank…” saves the annotations you selected for the genome to be annotated as a GenBank file
  • Close” exits the Genome Annotator window

 

The Edit Menu

picture

  • Unselect” deselect highlighted rows in any table and deselects selections in the genome maps.
  • New Annotation” appends a new row to the “Annotations” table. This allows for the manual definition of an annotation not found by the methods used in this application.
  • Find…” searches over a BLAST alignment, asks user to enter search text. Prior to selecting this feature you must select a row from the “Annotations” window.
  • Find next” searches as described in “Find” above with the last search text used

The Statistics Menu

  • Provides statistics on the genomes and annotation transfers
  • output:
    stats
 

The Help Menu

picture

  • About” summarizes the purpose of the Genome Annotator
  • Overview” displays an overview of the basic algorithm implemented
  • User Interface” introduces the components of the main window
  • Annotation Table” describes the view of the annotations main window
  • Unassigned-Orfs Table” describes the view of the unassigned ORFs main window
  • Reference Genes Table” describes the view of the reference genes main window
  • Genome Map” provides an overview of the features of the genome map
  • Buttons” explains the functions of each of the buttons
  • Menu Options” describes the GATU main menu and will direct you to this page
  • Preferences” details the preference settings within GATU
  • Tutorial” provides a step-by-step tutorial for the use of GATU
  • References” lists references used

 

 
 

The GATU Buttons

The functions of the genome annotator

 

The Main Buttons of GATU

picture
Annotate” will run the GATU BLAST to generate the list of genes to be annotated using the reference genome.
VGO” starts the Viral Genome Organizer program with the selected genes from the table.
BaseByBase” opens the Base-By-Base program and conducts an alignment using the alignment program you select in the subsequent window that pops up. The alignment is based on the first gene you select and the corresponding gene of the reference genome.
Save” exports the accepted annotations to a file. The file format depends on the selection you make: either GenBank, EMBL or XML (BSML).
 

The Main Tabs of GATU

 
New Annotations Tab
Within this tab you have the option to view all annotations found for the genome to be annotated and access the supporting BLAST and NEEDLE results
Annotations” displays the list of annotations found for the genome to be annotated including the corresponding start and stop positions, strand locations, gene sizes, gene names, BLAST scores etc. in the Annotations table
Blast Alignment(s)” displays the BLAST result for the selected gene in the “Annotations” table
Needle Alignment” displays the NEEDLE alignment for the selected gene in the “Annotations” table and its homolog
 
ORFs & Reference Genes Tab
By selecting this tab, you may view unassigned ORFs of the gene to be annotated and a list of all genes of the reference genome. By right-clicking on a reference gene or unassigned ORF, you can manually conduct a BLAST search for that gene or ORF of interest.
Unassigned-ORFs” displays all open reading frames from the six frame translation of the genome to be annotated in the Unassigned ORFs table (those that are not listed in the “Annotations” table – depending on the overlap settings you have defined in GATU preferences)
Reference Genes” displays all genes and mature peptides of the reference genome in the Reference Genes table

 

GATU Preferences

Preference settings

GATU Annotation Preferences

GATU Annotation Preferences
Defines three goups of preferences:

  1. GATU Parameters
  2. GATU Genome Map Parameters
  3. GATU BLAST Parameters

The GATU Parameters

  • define the NEEDLE % similarity threshold required for an alignment to be accepted
  • select genes strand location – independent of the strand on which the reference gene was found
  • define the minimum ORF nucleotide length for display in the “Unassigned ORFs” table
  • define the maximum overlap an ORF can have with an existing annotation to be considered as unique and therefore displayed in the “Unassigned ORFs” table
  • define the BLAST database (“BLAST DB Selection“) to use to conduct a BLAST alignment for unassigned ORFs
  • define if BLAST alignments for unassigned ORFs are to be run at the same time as the annotation process (“automatic“) or if the user will run BLAST alignments for unassigned ORFs as needed (“manual“)
  • set output format as HTML or Text

The GATU Genome Map Parameters

  • select to include or exclude ORFs in the annotation genome map (displayed as yellow bars)
  • choose whether the plot of the reference genome map should be shown or hidden
  • set the maximum number of overlapping genes/annotations a genome may have (and will be displayed in the Genome Map window); if this number is too small genes/annotations are drawn on top of each other.
  • increase/decrease the height of the area in which the genome map is shown by modifying the number for “Height of genome map display?”. Increasing the height will make the arrows for genes/annotations thicker, whereas decreasing the height will make the arrows for genes/annotations thinner.

The GATU BLAST Parameters
These BLAST parameters apply only to BLAST runs used to find and apply potential annotations to the genome to be annotated.

  • low complexity is always turned off
  • define the word sizes for BLASTp/tBLASTn and BLASTn
  • enter the number of descriptions to consider
  • enter the number of alignments to consider
  • select the drop-off value for BLAST alignments
  • define the maximum expect value for a BLAST hit to be considered

 

BLAST Preferences

GATU BLAST Application Preferences
This section makes it possible to define the parameters applied when running BLAST. Parameters that may be changed include word size, expect value, matrix, gap cost, drop-off value, maximum number of descriptions etc. If you would like to use your own BLAST executable and a matrix that is not listed, select “Run job locally on your machine” and enter the paths in the corresponding fields below.

 

NEEDLE Preferences

GATU NEEDLE Application Preferences
Here you may define the parameters applied to NEEDLE. You may change the matrix used, the penalty scores for Gaps (Open/Extend) and alignment format. If you would like to use your own NEEDLE executable and a matrix that is not listed, select “Run job locally on your machine” and enter the paths in the corresponding fields below.

 

Display Preferences

GATU Display Preferences
You may use these preferences to define the location and size of the GATU main window as well as the start position of the genome map (“divider”).

 

 

Blast Dialog

Blasting a ORF sequence or a reference gene sequence.
picture
This window will pop up when you right click on either an unassigned ORF or a reference gene name. The dialog window shows the protein sequence of the ORF/reference gene you selected and allows you to choose a database to BLAST against. You can also select the format of the BLAST output.
Once you have chosen your preferences, click on either “Run BLASTp” or “Run tBLASTn” to start the BLAST run. The output will then be displayed in a separate window.
By default GATU uses the application server running on leto.bioc.uvic.ca (142.104.33.4, TCP Port# 7777) to run the requested blasts.

 

 

System/Software Requirements

Lists the requirements for the workbench applictaions.
General System Requirements

Software Notes

  • Macs

Requires OS X; Java and Java Web Start is included in this OS.
To compile our java code you’ll need JDK1.5.0 (or higher). On OSX 10.4.n link /System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK to /System/Library/Frameworks/JavaVM.framework/Versions/1.5.0
Check with java -version to ensure you are using 1.5.n.

  • LINUX

Java Web Start assumes that you have a Netscape browser installed on Linux, and that it can be started using the command “netscape”. If this is not the case, then you need to change the start command which is defined in the preferences. Start the javaws application and select File|Prefereneces, and enter your browser start command – e.g. /usr/bin/mozilla

  • WINDOWS

Java Web Start assumes that Internet Explorer is the default browser.
Network Requirements
All our applications use the application server on leto.bioc.uvic.ca (IP 142.104.33.4, TCP Port# 7798) to execute non-java programs like clustalw, blast, needle, muscle, etc
All of our ReHAB databases which are located on Ignis.bioc.uvic.ca (142.104.33.37) are accessed through our ReHAB server which is located on demeter.bioc.uvic.ca (IP 142.104.33.27, TCP Port# 14103).
All of our JIPS databases which are located on demeter.bioc.uvic.ca (142.104.33.110) are accessed through our JIPS server which is located on demeter.bioc.uvic.ca (IP 142.104.33.110, TCP Port# 14104).

Server Name IP Address TCP Port#
Application Server 142.104.33.4 7798
Application Server Brunetti 142.104.33.4 7799
Application Server Dixon 142.104.33.4 7802
Application Server Cup 142.104.33.4 7803
ReHAB Server 142.104.33.27 14103
JIPS Server 142.104.33.27 14104

To access our databases which reside on 4virology.net (142.104.33.133) we use the following ports:

Virus Family TCP Port#
Adenoviridae 4440
Arenaviridae 4500
Asfarviridae 4440
Baculoviridae 4440
Bunyaviridae 4500
Coronaviridae 4440
Filoviridae 4500
Flaviviridae 4500
Herpesviridae 4440
Iridoviridae 4440
Paramyxoviridae 4500
Poxviridae 4500
Togaviridae 4500
Iridoviridae Brunetti TCP Port#
Brunetti 4505
Asfarviridae Dixon TCP Port#
Dixon 4506
Phycodnaviridae Cup TCP Port#
Cup 4507

Please ensure that access to these ports from the systems listed above is allowed in through your firewall.

 

 

Genome Selection

Selecting the reference genome and the genome to be annotated
GATU can be activated either as a stand-alone application or as a tool from within VOCs Database Administration (VOCs DB Admin). The only difference is that the stand-alone application will read all input data from files (GenBank or FASTA) whereas the tool from within VOCs DBAdmin will read the input data from a database.

 

Genome Selection in the Stand-alone Version

picture
To load either a reference genome or a genome to annotate click on the appropriate “Upload Genome File” button. If you select the button corresponding to the reference genome, you will be prompted to enter the name and location of the GenBank file you wish to use (GBxml or plain text format). Once the file is read, the genes defined in the GenBank file are shown in the Reference Genes Table.
If you select the button corresponding to the genome to be annotated, you will be prompted for the GenBank (GBxml or plain text format) or FASTA file for the genome to be annotated. Once both files are read, the annotator can run a series of BLAST alignments (one for each gene in the reference genome).
To start the annotation process click the “Annotate” button.

 

Genome Selection Using the VOCs DB Admin Version

picture
To load either a reference genome or a genome to annotate select a genome from the drop down menu. Once a selection for both genomes has been made the annotator can run a series of BLAST alignments (one for each gene in the reference genome).
To start the annotation process click the “Annotate” button.