GATU how-to documentation

GATU how-to documentation


Introduces the user interface of the Genome Annotation Transfer Utility
How to select the reference genome and the genome to be annotated in:
         The Stand-alone Version
         The VOCs DB Admin Version
Describes the table showing the annotations
Describes the display/view of the unassigned ORFs found in the genome to be annotated
Describes the table showing the reference genes
Describes the graphical view of the genomes and their annotations
The functions of the genome annotator
Describes the basic algorithm used to annotate a genome
Describes how to submit an annotated genome to GenBank

User Interface

Introduces the user interface of the Genome Annotation Transfer Utility.
picture

Menu Bar

The File Menu

picture
picture

  • “Preferences” allows you to define parameters for BLAST and NEEDLE and other annotation features within GATU; you can also set display preferences here.
    • By selecting “BLAST/NEEDLE preferences (manual process only)”, you can select the BLAST parameters (ie. expect value, which matrix is used, word size etc.) for BLASTp, BLASTn, tBLASTn, psiBLAST and BLASTx. If you select the arrow at the top of this “Application Preferences” window, you will be able to select “NEEDLE” from the subsequent drop down menu; it is here that you can set the preferences for NEEDLE alignments (ie. matrix used, alignment format, gap penalty value etc.).
    • To set the annotation preferences, select “Annotation Preferences (automatic process)”. You will be able to set preferences for the reference genes, unassigned ORFs, GATU genome map and GATU BLAST. These preferences include setting the gene location (top/bottom strand), seting the maximum % overlap for an ORF vs. an annotation, setting a minimum ORF length for possible annotations without a homolog, etc.
    • In the “Display Preferences” window, you can set the window size and location of the menu bar.
  • Save as EMBL…” saves the annotations you selected for the genome to be annotated as an EMBL file
  • Save as XML (BSML)…” saves the annotations you selected for the genome to be annotated as an XML file
  • Save as GenBank…” saves the annotations you selected for the genome to be annotated as a GenBank file
  • Close” exits the Genome Annotator window
The Edit Menu

picture

  • Unselect” deselect highlighted rows in any table and deselects selections in the genome maps.
  • New Annotation” appends a new row to the “Annotations” table. This allows for the manual definition of an annotation not found by the methods used in this application.
  • Find…” searches over a BLAST alignment, asks user to enter search text. Prior to selecting this feature you must select a row from the “Annotations” window.
  • Find next” searches as described in “Find” above with the last search text used
The Help Menu

picture

  • About” summarizes the purpose of the Genome Annotator
  • Overview” displays an overview of the basic algorithm implemented
  • User Interface” introduces the components of the main window
  • Annotation Table” describes the view of the annotations main window
  • Unassigned-Orfs Table” describes the view of the unassigned ORFs main window
  • Reference Genes Table” describes the view of the reference genes main window
  • Genome Map” provides an overview of the features of the genome map
  • Buttons” explains the functions of each of the buttons
  • Menu Options” describes the GATU main menu and will direct you to this page
  • Preferences” details the preference settings within GATU
  • Tutorial” provides a step-by-step tutorial for the use of GATU
  • References” lists references used

Genome Selection

GATU can be activated either as a stand-alone application or as a tool from within VOCs Database Administration (VOCs DB Admin). The only difference is that the stand-alone application will read all input data from files (GenBank or FASTA) whereas the tool from within VOCs DBAdmin will read the input data from a database.

Genome Selection in the Stand-alone Version

picture
To load either a reference genome or a genome to annotate click on the appropriate “Upload Genome File” button. If you select the button corresponding to the reference genome, you will be prompted to enter the name and location of the GenBank file you wish to use (GBxml or plain text format. Important: Make sure the selected file name does not contain any spaces). Once the file is read, the genes defined in the GenBank file are shown in the Reference Genes Table.
If you select the button corresponding to the genome to be annotated, you will be prompted for the GenBank (GBxml or plain text format) or FASTA file for the genome to be annotated. Once both files are read, the annotator can run a series of BLAST alignments (one for each gene in the reference genome).
To start the annotation process click the “Annotate” button.

Genome Selection Using the VOCs DB Admin Version

picture
To load either a reference genome or a genome to annotate select a genome from the drop down menu. Once a selection for both genomes has been made the annotator can run a series of BLAST alignments (one for each gene in the reference genome).
To start the annotation process click the “Annotate” button.

The New Annotations Tab

In the “Annotations table” all possible gene annotations for the genome to be annotated (excluding unassigned ORFs) are displayed. These genes are either homologs of genes in the reference genome or ORFs of length greater than the ORF length defined in the preferences.

Annotations


Lists the gene annotations found for the genome to be annotated. Clicking on any row in the table results in the display of the DNA and corresponding protein sequences of the selected annotation in the narrow window directly below the “Annotations” window.   Selecting a row in this table will highlight the corresponding gene in the “Genome Map” window while clicking on a column header will re-sort the list of annotations based on the column header you select. All columns are editable except for “Size”, “P.Size”, “Score” and “% Similarity”.
The columns in the table are as follows:
Gene Name : the number of the gene to be annotated
Product : the product property from the CDS tag in the GenBank file
Exon# : the exon number (ie 1st, 2nd, etc exon)
Start : start position of the gene
Stop : stop position of the gene
+/- : strand location of gene (“+” (5′-3′) or “-” (3′-5′))
Size : size of the gene (listed as number of basepairs)
P.Size : size of parent gene (gene in reference genome) in basepairs
GeneType : gene/fragment/mature peptide
Score : BLAST score from either the automatic BLAST run or the most recent manual BLAST run
% Similarity : BLAST similarity from either the automatic BLAST run or the most recent manual BLAST run
Accept : a box provided to allow you to select or de-select ORFs to either add or exclude them from the annotation. To include an ORF in the annotations, click on the corresponding box – a checkmark should appear in the box.

BLAST Alignments

Clicking “BLAST Alignment(s)” will display the BLAST alignments for the selected gene’s protein as shown in the window below. In the alignment output, the top sequence (query) is the the protein sequence that corresponds to the sequence of the gene belonging to the reference genome. The bottom sequence is that of the genome to be annotated (subject).
BLAST Alignment

NEEDLE Alignment

Clicking “NEEDLE Alignment” will display the NEEDLE alignment for the seclected gene’s protein and its homolog as shown in the window below. In the alignment output, the top sequence is the the protein sequence that corresponds to the sequence of the gene belonging to the reference genome. The bottom sequence is that of the genome to be annotated.

NEEDLE Alignment

Unassigned ORFs table

The Unassigned ORFs table displays the ORFs of the genome to be annotated (excluding the ORFs that have already been listed in the “Annotations table”).

Lists the unannotated ORFs found in the genome to be annotated. Clicking on any row in the table results in the display of the DNA and corresponding protein sequences of the selected ORF in the narrow window directly below the “Unassigned ORFs” window. Clicking on a column header will re-sort the list of unassigned ORFs based on the column header you select. The columns “ORF Name” and “Gene Type” are editable.

To run a BLAST search (either BLASTp or tBLASTn) on any ORF in the table, select the ORF and right-click anywhere in the selected row. A window will subsequently pop up with a BLAST dialog box . You can now select the database to BLAST against as well as the output format. The protein sequence will be displayed in the sequence window of the BLAST Dialog window.

The columns in the table are as follows:

ORF Name : the name/number of the unassigned ORF
Start : start position of the ORF
Stop : stop position of the ORF
+/- : strand location of ORF (“+” (5′-3′) or “-” (3′-5′))
Size : the size of the ORF (listed as number of basepairs)
GeneType : gene/fragment/mature peptide
Score : BLAST score from either the automatic BLAST run or the most recent manual BLAST run
% Identity : BLAST identities from either the automatic BLAST run or the most recent manual BLAST run
Accept : a box provided to allow you to select or de-select ORFs to either add or exclude them from the annotation. To include an ORF in the annotations, click on the corresponding box – a checkmark should appear in the box.

By default, GATU uses the application server running on leto.bioc.uvic.ca (142.104.33.4, TCP Port# 7777) to run the requested BLAST searches.

Reference Genes Table

The Reference Genes Table displays the genes of the reference genome.
picture
Lists the genes of the reference genome. Clicking on any row in the table results in the display of the DNA and corresponding protein sequences of the reference gene in the narrow window directly below the “Reference Genes” window; the gene will also be highlighted in the “Genome Map” window. Clicking on a column header will re-sort the list of unassigned ORFs based on the column header you select. None of these columns are editable.

To run a BLAST search (either BLASTp or tBLASTn) on any reference gene in the table, select the gene and right-click anywhere in the selected row. A window will subsequently pop up with a BLAST dialog box . You can now select the database to BLAST against as well as the output format. The protein sequence will be displayed in the sequence window of the BLAST Dialog window.

The columns in the table are:

Gene Name: the number of the gene in the reference genome
Start: start position of the gene
Stop: stop position of the gene
+/-: strand location of the gene (“+” (5′-3′) or “-” (3′-5′))
Size: the size of the gene (listed as number of basepairs)
GeneType: gene/fragment/mature peptide
Family/Product: gene function (the product property from the CDS tag in the Genbank file)

By default GATU uses the application server running on leto.bioc.uvic.ca (142.104.33.4, TCP Port# 7777) to run the requested blasts.

The Genome Map

This is a graphical display of the reference genome and its genes/mature peptides (top genome) and the genome to be annotated with the potential gene annotations (bottom genome).

picture
The uppermost genome map is that belonging to the reference genome and its annotations as displayed in the Reference Genes Table. The bottom genome map shows the genome to be annotated along with all accepted annotations from either the “Annotations table” or the “Unassigned-ORFs Table”.
Clicking on any gene shown in the map will display the name of the gene in the text box directly below the “Genome Map” window. The slider directly across from the text box provides a zoom function; you can increase the size of the image to a maximum of 20 times the original size. The reset button will reset the image to its original size and deselect all genes. The jump button will skip ahead in the genome map to the view of the genes selected in the “Annotations table”.

If you would like to view all ORFs listed in the “Unassigned ORFs table”, you may do so in the preference settings; the unassigned ORFs will then be displayed as yellow bars in the genome map as shown below.

picture

The GATU Buttons

These buttons provide you with access to the main functions of the applications – such as “Annotate”, “VGO” (graphical view of the annotated genome), “Base-By_Base” (an alignment editor) and “Save” features.

The Main Buttons of GATU

picture

  • Annotate” will run the GATU BLAST to generate the list of genes to be annotated using the reference genome.
  • VGO” starts the Viral Genome Organizer program with the selected genes from the table.
  • BaseByBase” opens the Base-By-Base program and conducts an alignment using the alignment program you select in the subsequent window that pops up. The alignment is based on the first gene you select and the corresponding gene of the reference genome.
  • Save” exports the accepted annotations to a file. The file format depends on the selection you make: either GenBank, EMBL or XML (BSML).

Algorithm

Describes the basic algorithm used to annotate a genome
picture

  1. Read all genes/mature peptides of the reference genome and display them in the “Reference Genes” table.
  2. Conduct a tBLASTn/BLASTn alignment for every gene/mature peptide of the reference genome against the genome sequence to be annotated (tBLASTn is used for single exon genes and BLASTn for multiple exon genes).
  3. Use the highest scoring hit and make this a possible new gene/mature peptide. If the reference gene starts with a start codon, extend this hit to a start codon for the sequence to be annotated. If the reference gene ends with a stop codon, extend this hit to a stop codon for the sequence to be annotated. If the reference gene has no internal start/stop codon (atg/tag, tga, taa), verify hit has no internal start/stop codon and use longest orf if hit has internal start/stop codon.
  4. Run a NEEDLE alignment for each annotation found and mapped for the newly-annotated genome against the reference gene/mature peptide.
  5. Find all possible ORFs and display ORFs not found in step 3 in the “Unassigned-ORFs” table.
  6. Display all possible new genes/mature peptides found in step 3.
  7. Manual review and/or manual modifications.
  8. Apply annotations to genome and/or save annotations to file (GenBank, EMBL or XML (BSML) file).

 

GenBank Submissions

To submit an annotated genome, the requirements are a “FASTA file” and a “Feature Table” if you use SEQUIN. To learn how to prepare these files for submission, refer to this guide.