STS User Guide

STS Quick Reference


 

STS Quick Reference — Controls, Buttons, Widgets, etc.

This document concisely describes the various features of the STS (“Suffix Tree Searcher”) program. Please see the User Manual for more comprehensive and detailed information.

MAIN PANES AND AREAS

Input pane

The input pane is used to enter queries, initiate most actions, and provide an overview of the input sequences used to construct the data set.

Top bar

The top bar contains the buttons used to initiate import FASTA, build tree , search,  and traverse actions. It also contains the Settings button, which opens the settings window.

  • Import FASTA – Allows the user to choose one or more FASTA files to import into STS. Sequences must be imported before other actions such as tree building can be taken. Multiple files can be selected using shift-click or cmd-/ctrl-click, and must reside in the same directory. The user is prompted to choose a name for the suffix tree to represent the data set, which by default is the name of the first file selected. This name will also be used for the name of the on-disk file that will contain all suffix tree data for the dataset.

  • Build tree – Initiates the construction of the suffix tree index for the input data set. This tree is required in order to query the data set. This process may take several minutes for very large inputs. Settings can be adjusted in the Settings window.

  • Import tree – Load a previously constructed suffix tree. Since building the suffix tree is the most time-consuming part of the process, this option allows the user to skip the building phase the next time the program is opened. Consequently, the sequences can be analyzed on different computers without having to rebuild, by transferring the suffix tree data, and importing the tree.

  • Traverse – Initiates a traversal of the suffix tree. Queries and settings are entered in the traversal section of the query input area. Additional settings are available in the Settings window.

Sequence display area

Provides an overview of the input sequences, in the form of a table possessing one row per input sequence. The columns in the table are described below.

  • Number – The number associated with the sequence.
  • SeqName – The name of the sequence as indicated in the FASTA file.
  • FileName – The name of the FASTA file that the sequence was imported from.
  • OrigSize – The number of bytes read from the file for the sequence.
  • DNASize – The number of bytes of the sequence that corresponded to the characters ‘acgt’. All other characters are ignored.
  • Begin – The first sixteen nucleotides of the sequence.
  • End – The last sixteen nucleotides of the sequence.
  • Wildcards – The number of wildcard characters encountered in the sequence. (R, Y, W, S, M, K, H, B, V, D, and N).
  • NChunks – The number of contiguous chunks of N characters encountered in the sequence.
  • GarbageBytes – The number of non-ACTG non-wildcard characters encountered in the sequence.
  • TailLength – The number of repeats of a single motif encountered at the end of the sequence. This column is deprecated; it was previously used for development purposes.

Query input area

The query input area is used to specify the options for searches and traversals before they are initiated through top bar buttons. As such, the area is divided into Search and Traverse areas, each with their own input boxes.

Search

  • Pattern – Used to specify search patterns, using ACGT to specify nucleotides, and * to specify wildcards. Patterns can also be specified by entering 3 numbers per pattern for <sequence, position, length> e.g. “2, 125000, 25”. If multiple patterns can be entered by separating individual patterns with a semi-colon “;”.
  • Mismatches – This box is used to specify the number of mismatches in the search pattern entered into the Pattern box. If left blank, the number of mismatches allowed will be zero. Mismatches currently do not work when specifying multiple patterns at once.
  • Max results – The maximum number of search results to display in non-standard results tables.

Traverse

  • Threshold length – Common substrings must be at least this length in order to show up in results tables.
  • Display length – Restricts the number of nucleotides that are displayed for each common substring. When copying the sequence from the results table for a pattern search, the entire sequence will still be copied
  • Max results – The maximum number of search results to display in non-default result sets.
  • Reference genome – Only display results common substrings that are present in this sequence. Takes a sequence number, as shown in the Number column of the Sequence Display Area of the Input Pane.

Console

The console area displays output from the backend executables when building the suffix tree or importing sequences, such as the amount of time a process took to complete, and is used to report any errors that arise during a process. Red text will be shown here if an error occurs.

Output pane

The output pane is used to display and organize the output tables produced by searches and traversals.

Top bar

  • Clear All – Remove all results tables from the display area.
  • View Inputs – Open a window that maps sequence numbers to sequence names. This is useful for determining which result corresponds to which sequence(s).
  • Save Results – Save the current results tables to disk, so that they can be re-loaded later.
  • Load Results – Load previously saved results tables.

Results display area

Traversal result tables
Each row in a traversal result table corresponds to a LCS. LCSs are always maximal – they cannot be extended in either direction without violating one or more of their properties. The columns listed below are present in all traversal result tables.

  • Number – The hit number. This number serves to identify the specific LCS in the context of the given result table.
  • Where – A string of binary digits of length <=32, containing 1’s and 0’s, representing the different input sequences. If there is a 1 in a give position, this indicates that the substring is present in the sequence corresponding to that position, starting at one on the left and counting up for each position to the right.  For example, a value of 100110 indicates that a substring is present in sequences 1, 4, and 5.
  • K – The number of sequences the substring is present in.
  • N – The number of times the substring is present in the number of sequences that the table displays results for.
  • NC – The number of other substrings that exist for the given N and length values, since only one is displayed for each length, i.e. In a table that displays only one result for each N-value, if there exist two different sub sequences for N=2, only the longest one will be displayed, but if they are both of the same length, the table will still only display one, and the NC value will be 1 to represent the sequence that is not shown.
  • Sequence – One of the input sequences that the substring occurs in.
  • Position – The start position of the substring in the sequence specified in the sequence column.
  • Length – The length of the substring.
  • Pattern – The nucleotide sequence of the substring. The number of nucleotides displayed here is controlled with the “Display length” box in Traverse section of the Input Panel. The user can right-click the pattern to perform a pattern search for the sequence in the whole data set, without the need to copy and paste the pattern into the Input Pane.

Search result tables
Although the number of hits displayed is controlled by the user, the table title always shows the total number of hits.

  • Number – The hit number. This number serves to identify the specific result in the context of the given result table.
  • Sequence – The sequence that the hit occurred in.
  • Position – The position of the hit in the sequence.
  • Length – The length of the hit.
  • Right – The number of nucleotides remaining to the right (3′) of the hit in given input the sequence.
  • SeqLength – The length of the sequence that the hit was found in.
  • Count – The total number of hits for the query pattern.

Console

The console area displays output from the backend executables during searches or traversals, such as the amount of time a process took to complete, and is used to report any errors that arise during a process. Red text will be shown here if an error occurs.

Settings

The settings window contains tabs for altering the settings of STS:

General settings

Settings pertaining to the general operation of the program.

  • Workspace – Specifies the root directory that all files will be written to. This should be the same as the parent directory of the <suffix_tree_name>.ST file, if loading a previously constructed suffix tree.
  • Clear results – If selected, STS will delete all results tables in the Output pane every time a new search or traversal is run.
  • AdvancedThis section is intended primarily for development/debugging purposes. This section allows the user to feed command-line arguments directly to the individual executables that implement the back end of STS.

Build settings

Settings for suffix tree building.

  • Partition size (number of nodes) –  The number of nodes per suffix tree partition. The average user should not need to adjust this, as the partition size is automatically adjusted for the the user’s computer when the suffix tree is built.
  • Suffix buffer size (each) – The buffer size when constructing the suffix tree. The average user should not need to adjust this.
  • Skip sort phase – Suffix sorting phase of construction. May increase memory load during the subsequent merge phase. When building suffix trees for large data sets, selecting this option may decrease the time build time.

Traverse settings

Settings governing the types of result sets displayed from a traversal. Each setting will produce additional table(s).

  • LCS occurrences –  Default set. Gather the longest common substring (LCS) appearing N times, for N=2 to the number specified in any sequence. Produces a single result table.
  • LCS inputs – Default set. Gather the single longest substring  common K sequences for each value of K from K=2 up to the specified number of inputs. Produces a single result table.
  • Number of occurrences – Similar to LCS occurrences, but allows the user to specify the exact values of N, as a comma-separated list, e.g. 2, 3, 5. Produces a result table for each value of N.
  • Number of inputs – Similar to LCS inputs, but allows the user to specify the exact values of K, as a comma-separated list, e.g. 2, 3, 4. Produces a result table for each value of N.
  • Sets – Allows the user to specify specific subsets of sequences to include a result table for. Each set is specified with curly braces surrounding a comma/space-separated list of the sequences in the set, e.g. {1, 2, 5, 7}. Multiple sets are specified by placing them next to each other, e.g.  {1, 2, 5, 7}{1, 4, 6}. A result table will be produced for each set. The number of substrings displayed can be specified by the “Max results” section of the “Traverse” query input section of the Input Panel.
  • All singles – Produce a result table for each single sequence.
  • All pairs – Produce a result table for each possible pair of sequences.

Search settings

Settings governing the display of search results.

  • One result set – If selected, forces results for multi-pattern queries to all be displayed in a single table, rather than separate tables for each pattern.