VGO how-to documentation

VGO How-to Documentation


VGO is a Java-based interface for viewing and searching genomes. It can be used to display information about a genome, including its genes, ORFs, and start/stop codons. It can also be used to perform regular expression, fuzzy motif, LCS, and mass list searches.

VGO allows the user to identify related genes in multiple sequences. It also allows one to see how their search results compare to the identified genes, ORFs, and start/stop codons.

Open New Genomes in VGO

Analysis in VGO

Viewing Sequence Information in VGO

Searching in VGO

Open New Genomes in VGO

From the VOCs database into VGO
To load sequences from the VOCs database into VGO, go to the sequence listing window. VGO can only load sequences from one database at a time; if the correct database is not currently open, select Choose Database from the File menu.
Database Chooser
Select the database that you want, and click the Choose button.
Once you have the correct database selected, go to the File menu, choose Open, then choose Open from VOCs DB. The following dialog window will appear.
Sequence Chooser
Choose the sequences that you want to load (multiple sequences can be selected by holding down the Apple key (Mac) or CTRL (PC)), then click on the OK button.
From a Fasta File
To load sequences from a .fasta file, go to the sequence listing window. Go to the File menu, choose Open, then choose Open from Fasta. Select the file that you want to load, and each sequence will be loaded.
Note: Fasta files do not contain any annotations, only sequence data.
From a GenBank File
To load sequences from a GenBank file, go to the sequence listing window. Go to the File menu, choose Open, then choose Open from GenBank. Select the file that you want to load, and each sequence will be loaded, along with their annotations.
From a Base-by-Base File
To load sequences from a GenBank file, go to the sequence listing window. Go to the File menu, choose Open, then choose Open from BaseByBase. Select the file that you want to load, and each sequence will be loaded, along with any annotations that have previously been added to the file in Base by Base.

Viewing Sequence Information in VGO

Sequence Map Legend
To open the sequence map simply double click on the sequence you wish to view in the home window (sequence listing window).  For multiple sequences, select them using the Apple key (Mac) or CTRL Key (PC) and then click the  button.

This is a representation of the data mapping that is available within VGO. Here is an explanation of each of the items indicated above, and how to use them.
Name
Description
Single Click
Double Click
1
Sequence Search Results
Maps the locations of sequence search hits. There may be as many searches done on the sequence as you please.
Selects the region covered by the hit and displays information about the hit in the status bar.
Opens a sequence viewer with the sequence represented by the clicked on hit displayed.
2
Imported Analysis
Shows the location of features returned by some external source and brought as input to VGO.
Selects the region covered by the feature selected.
Opens a sequence viewer with the sequence represented by the feature displayed.
3
Colorized Gene Analysis
Displays the genes of this genome colored based on particular properties. See the gene analysis page for more information.
Selects the region, displaying pertinent information on the status bar.
Opens a legend describing the meaning of the colors used.
4
Open Reading Frames
Displays the open reading frames in the 3 frames of the sequence.
Selects the region covered by the selected ORF.
Opens a sequence display showing the sequence for the selected ORF.
5
Start and Stop codons
Shows the location of all start (green) and stop (red) codons in the sequence.
No Action
No Action
6
Gene Features
Shows the location of all genes in this sequence. Genes covering more than one open reading frame will be indicated with a connector. Through view settings you may display the gene label as the gene name, the family number for the gene or nothing at all for compactness.
Selects the region indicated by the feature

Opens the Gene Data Window for this gene.
View the details of the current sequence

You can view the sequence details from either the sequence listing window (the original window) or the sequence display window.

Sequence Listing Window
From the View menu, choose the Virus Info menu item.
Sequence Display Window
From the View menu, choose the Virus Info menu item.
Viewing Top/Bottom Strand DNA
The Sequence Viewer window displays a numbered, scrollable listing of the entire DNA sequence for the currently selected organism.  To get to this window, go to the View menu and chose either Top Strand DNA or Bottom Strand DNA.
This window listens to appropriate selections in the sequence map window. For instance, if you have an organism open in the sequence map window, with the Top Strand DNA window also open, you can see the results of selections made in the sequence map in the DNA window.
This window should be used to view DNA sequences, and relate them to information displayed on the sequence map. To copy sequence data to the clipboard, it is recommended that you use the “Genome Subsequence” facility, from the “View” menu.
View a 6 frame amino acid translation of the selected sequence.
The framed translation facility of VGO displays a 6 frame amino acid translation of a selected portion of sequence.
Framed Translation Comparison
To get a framed translation for one or more sequence segments, select portions of one or more organisms (via the Sequence Map). Then, in the main window, select those organisms you wish to display, then choose “Framed Translation” from the “View” menu.
Note: The framed translator can handle selections of no more than 10kb in length. If a larger selection is attempted, a warning dialog box will be displayed.
The framed translation window displays each amino acid in single letter abbreviated format with stop codons marked with a “*”. The ruler along the top marks each 10 bases in the DNA sequence with a tick, and each 20 bases are marked with their position in the entire sequence for the organism being displayed.
The sequence displayed on the reverse side is the complement of that on the forward strand. Genes occurring on the top are coded from left to right, and on the bottom from right to left.
View information about a gene
This window allows for closer analysis of a particular gene. To open this window, double click on a gene in the sequence map window.
In this window, you are able to view:
  1. The name of this gene as stored in the database
  2. The multiple alignment of this particular gene’s family through the use of Jalview
  3. The protein sequence for the gene (This information is selectable and may be used to copy into other applications, such as dotlet or other alignment programs)
  4. The DNA sequence for the gene by clicking this button
  5. Details of this gene, such as amino acid frequency, name, etc. by clicking this button
  6. The most recent Tblastn, Psiblast or Blastp reports generated for this gene by clicking one of these buttons
  7. The Blast reports for this gene through MView
  8. The genes contained with in this particular gene’s family by clicking this button
Viewing Options for the Sequence Display Window
The following is a list of independent display options for the Sequence Display window.  These options may be turned on or off for each individual sequence or for all sequences.

    Start/stop codons
    ORF with a minimum ORF length option
    Bottom strand
    Gene Labels
– Gene number
– Family number
– Short Gene number
– GenBank name
    Lane Descriptions
    GFS
    Repeat Regions
    BBB Comments
    BBB Primers

Searching in VGO

Search Options
On the top menu, under Analysis, you will find four search options listed.
Search All Sequences
This search searches all the sequences you currently have open with either a Fuzzy or Regular Expression search.
Search Selected Sequence
This search searches only the selected sequences with either a Fuzzy or Regular Expression search.
LCS Search
This search searches for the longest common subsequence (or a minimum length common substring) with a few options.
DLCS Search
This search searches for the shortest common subsequence with a variable number of mismatches.

GFS (Genome-based fingerprint scanning)

This search identifies the genomic origins of sample proteins by mapping their peptide mass fingerprint data directly to raw genomic sequence.  GFS is developed by the Giddings lab in University of North Carolina.
Regular Expression Search

You can do a regular expression search from the sequence display window; select Reg. Expression Search from the Analysismenu. The following dialog will then appear.
Enter your regular expression in the textbox, and click on the OK button to perform the search, or the Cancel button to close the dialog window without performing the search.
Regular expressions allow one to search for precise patterns which may include optional sections and/or repeated sequences. For detailed help on regular expressions, please see The Perl Regular Expression page for more information.

Examples of Regular Expression Searching
Regular Expression What it matches
ACT ACT
[ AC ]T A or C followed by T
AC[ ^T ]ACT AC followed by anything BUT a T followed by ACT
ACT* AC followed by 0 or more T’s
(ACT)* ACT repeated 0 or more times
(ACT)+ ACT repeated 1 or more times
(ACT)? ACT repeated 0 or 1 times
(ACT){n} ACT repeated n times
(ACT){n,} ACT repeated at least n times
(ACT){n,m} ACT repeated at least n times but not more than m times
((AC)[ TA ]){n} AC followed by T or A – repeated n times
Fuzzy Motif Search
You can do a fuzzy motif search from the sequence display window Select Fuzzy Motif Search from the Analysis menu. The following dialog will then appear.

Enter your fuzzy motif in the top textbox and enter the number of mismatches to allow in the lower textbox. Click on the OKbutton to perform the search, or the Cancel button to close the dialog window without performing the search.
The Fuzzy Motif Search allows users to enter in an expression pattern (see below for an explanation of the pattern grammar used) as well as a maximum number of mismatches tolerated in a search hit. VGO then searches marked sequences for this motif and displays the list of hits by location along the sequence. In addition to the ambiguities created by mismatches, users may enter in IUB ambiguity codes, which are also indicated below.

Examples of Fuzzy Motif Searching
Fuzzy Expression What it matches
ACT an A, C, T pattern
[ AC ]T an A or a C followed by a T
{AC}T Everything but an A or a C followed by a T
ACT{1,3} An A, then C followed by 1, 2 or 3 T’s

Note: When counting mismatches, [] and {} count as a single match or mismatch. As well, if matching T(2,4) and only 1 T is found, this counts as a single mismatch.

Table of IUPAC Ambiguity Codes
IUPAC-IUB/GCG Code Meaning Complement
A A T
C C G
G G C
T/U T A
M A or C K
R A or G Y
W A or T W
S C or G S
Y C or T R
K G or T M
V A or C or G B
H A or C or T D
D A or G or T H
B C or G or T V
X/N G or A or T or C X
. Not G or A or T or C .
GFS (Genome-based fingerprint scanning)
You can do a GFS search from the sequence display window Select GFS from the Analysis menu.
Enter your mass list in the textbox labeled Mass List, or click on the Load Mass List File button and select a file that contains the list of masses. When you are ready, click on the OK button to perform the search, or the Cancel button to close the dialog window without performing the search. You may see another dialog window depending on your preferences where you can enter the parameters for the GFS search.

Analysis in VGO

Gene analysis tools
There are currently 3 properties that can be used to color-code genes on the sequence map in VGO. These properties are: Base Composition, Amino Acid Composition and Family Representation. When this menu item is selected, a dialog appears requesting parameters to display. Once these are set, VGO maps this data on the screen. To see a longer explanation of the parameters chosen and a range of values, double click on one of the resulting gene features.

Base Composition
This colors genes based on the percentage of one or more nucleotide bases that they contain.
Amino Acid Composition
This colors genes based on the percentage of one or more amino acids that they contain.
Family Representation
This maps the virus frequency of each gene in the displayed genome. This is to say, it maps based on the number of viruses represented in the particular gene’s family.
Import Analysis into VGO
VGO allows for the import of analyses done externally.
Currently, this is done through the use of flat text files which contain descriptions of specific regions of interest. This information includes the start position, the end position, the strand and a description of the regions. Any number of different analyses may be contained within a file, and any number of regions may be contained within an analysis.
To open a file, select “Import Analysis” from the “Analysis” menu on the sequence map window.
The file format is as follows:
  >Analysis 1 name
start|stop|strand|description|color
  start|stop|strand|description|color

>Analysis 2 name

etc.
An Example:
 >First VGO Import Analysis Example
100|400|POSITIVE|region 1|2F4F4F
500|600|NEGATIVE|region 2|2E8B57
>Second VGO Import Analysis Example
400|500|POSITIVE|region 3|A52A2A
600|700|NEGATIVE|region 4|A52A2
This example would produce the following mapping in the VGO sequence map
Note: The analysis will fail to import (without any error messages) if the analysis file is formatted wrong. The most common issue is if the Start value is larger than the Stop value. Therefore, for genes on the negative strand, it is necessary to reverse the start and stop positions.
Display Graph of Nucleotide Base Content
Currently, VGO is able to plot nucleotide composition, sampled at customizable rates, along the entire genome. To get this information, select “Display Graph” from the “Analysis” menu in the sequence map. The currently active panel will then display a graph of the percent composition of the bases chosen.
This graph panel will appear at the bottom of the sequence map and display the following information:

  • A plot line in red. This plots the data long the sequence at the scale currently displayed in the map window
  • A description of the data being plotted
  • The sampling window size, which is customizable through the use of a slider
  • The minimum percent composition along the entire genome (Y-axis minimum)
  • The maximum percent composition along the entire genome (Y-axis maximum)
  • The mean percent composition along the currently viewable portion of sequence. This is displayed in blue.

Sequence Map Display with Graph
Graph calculations are done using a sliding window scale, sampled at regular fractions of the window size. By default, the fraction of the window size used to sample the data is 1/3. This means that with a window size of 60, data will be sampled every 20 bases. In addition, the information displayed at every 20th base will be the average from that position forward to the end of the window. For performance reasons, only the data currently displayed on the screen will be sampled. A side effect of this is that the the shape of the graph may appear to change as you scroll left or right through the map display.