Python/Perl Scripts

Python/Perl Scripts


Here are a list of python/perl scripts. For information on how to run python scripts, go here.
Note: All python scripts run in Python 2

SNIP.py
BSNIP.py
cSNP.py
offset.py
genbankToTbl.py
FindLongestOrf.pl

SNIP.py

Download
This script compares the columns in a multiple sequence alignment, calculates the consensus nucleotide in each column, and then counts the number of times that a nucleotide other than the consensus one appears. This count is considered the threshold value. If the threshold value is set to =1 in the SNIP.py script, then whenever a non-consensus nucleotide differs from consensus in a given column one time, that single polymorphism will be changed to the consensus nucleotide. The threshold value can be set to any number by opening the script in a text editor and changing THRESHOLD = n where n= the number of nucleotides differing from consensus to be changed.
For example, if you set n=3, then any time 3 or less nucleotides differ from from the consensus value in a column of the alignment, they will all be changed to the consensus nucleotide. If 4 nucleotides were different from consensus in a column in this same example, then that column would be left unchanged.
Note: This script now deals with columns containing more than two different nucleotides. If it detects, for example, an A, C, and T in a single column, then it will no longer skip the column and move onto the next – it will change any nucleotide to the consensus if the count is less than or equal to the threshold.

STRAIN 1

A

C

G

G

T

G

T

T

A

A

C

C

T

C

G

A

STRAIN 2

A

C

T

G

T

A

T

T

A

T

C

C

T

C

G

A

STRAIN 3

A

C

G

G

T

A

T

T

A

A

C

A

T

C

G

A

STRAIN 4

A

C

C

G

T

A

T

T

A

A

C

C

T

C

G

A

STRAIN 5

A

C

G

G

T

G

T

T

A

A

C

C

T

C

G

A

Figure 1: Multiple sequence alignment before SNIP.py program alterations. Threshold value set to 1.

STRAIN 1

A

C

G

G

T

G

T

T

A

A

C

C

T

C

G

A

STRAIN 2

A

C

G

C

T

A

T

T

A

A

C

C

T

C

G

A

STRAIN 3

A

C

G

G

T

A

T

T

A

A

C

C

T

C

G

A

STRAIN 4

A

C

G

G

T

A

T

T

A

A

C

C

T

C

G

A

STRAIN 5

A

C

G

G

T

G

T

T

A

A

C

C

T

C

G

A

 Figure 2: The same Multiple sequence alignment after modification with the Snip.py script. The values in blue have been changed to the consensus value. The red ones remain the same due to the set threshold value.

Running the script entails placing a copy of the script, plus a copy of the input file (in fasta format) into the same directory. On a Mac, if your folder was named ‘genome analysis’ and located on the desktop, the command to navigate to the appropriate folder would look something like this:


cd /Users/username/Desktop/genome analysis

From this directory (where you’ve placed your script and input fasta) you would then enter the command:


python snip.py -i input.fasta -o output.fasta [-f] [-t <threshold value>]

The different commands here used are:
“python” – tells the computer that you’re about to use a python script
“snip.py” – tells the computer what the script name is you want to use
“-i” – tells the computer that the following line will be your file input name“-o” – tells the computer that the following line will be the output filename that you want for your modified script
“-f” – forces the script to return the modified file in fasta format

“-t” – tells the computer you want to specify a threshold value for changing the differing nucleotides to the consensus. By default the threshold is set to 1.

BSNIP.py

Download
This is essentially the same program with the same commands as SNIP.py, only instead of using fasta-formatted files; it modifies .bbb files, thus preserving any comments or annotations made to your alignment. Script usage is as follows:


 python bsnip.py -i inputfilename.bbb -o outputfilename.bbb [-t <threshold value>]

 cSNP.py

Download

 This program deletes any column from your alignment that is entirely conserved, thereby producing an alignement consisting only of columns where genetic diversity is present: a concatenated single nucleotide polymorphism alignment.

Example:

STRAIN 1

A

C

G

G

T

G

T

T

A

A

C

C

T

C

G

A

STRAIN 2

A

C

T

G

T

A

T

T

A

T

C

C

T

C

G

A

STRAIN 3

A

C

G

G

T

A

T

T

A

A

C

A

T

C

G

A

STRAIN 4

A

C

C

G

T

A

T

T

A

A

C

C

T

C

G

A

STRAIN 5

A

C

G

G

T

G

T

T

A

A

C

C

T

C

G

A

Figure 3: Pre-modified alignment

STRAIN 1

G

G

A

C

STRAIN 2

T

A

T

C

STRAIN 3

G

A

A

A

STRAIN 4

C

A

A

C

STRAIN 5

G

G

A

C

Figure 4: Alignment pos-modification with cSNP.

The usage of the script is as follows:


 python csnp.py -i inputfilename.fasta -o outputfilename.fasta [-f] [-d]

Arguments for the csnip.py script:
-i         input fasta file
-o         output file result
-f          force fasta file output
-d          count dashes as variation*
* Since the user may want to count gaps as genetic diversity or not count them, the program is set to default ignore gaps, and treat them as a consensus position. If you want to include gaps in your analysis as diversity, then add –d at the end of the command line, and all positions where a gap appears will be included in the concatenated snp alignment.

offset.py

Download 
This script was designed specifically for BBB in the instance that you’ve truncated the left-hand side of the alignment to do a core analysis, for example.

The script will modify a Genbank file containing gene annotations by offsetting the gene position numbers by a designated number. For example, if you’ve chopped off the first 400nt of bad sequence from your analysis, you would want to tell the program to offset the gene locations in the Genbank file by 400nt so that you can import it into BBB and maintain the correct gene positions for your annotation.

The usage is as follows:

  python offset.py -i input_genbank_filename.gb -n 400 –o output_genbank_filename.gb

genbankToTbl.py

Download
This script was designed to generate a Fasta file and a 5 column, tab-delimited ‘feature table’ used for entering annotations into Sequin and tbl2asn.
The script takes in an annotated Genbank file and creates the feature table using the feature locations and qualifiers approved by the International Nucleotide Sequence Database Collaboration. The format for the feature table can be found at <http://www.ncbi.nlm.nih.gov/Sequin/table.html>
The usage of the script is as follows:

  python genbankToTbl.py input_genbank_filename.gb

The script produces the following output:

  seq.fsa, seq.tbl

FindLongestOrf.pl

Download
More information about this script can be found here.