oligocounter logo
Home

Services







banner

Extract oligos from fasta files

How to use OligoCounter:

OligoCounter is a Java program that analyse chromosomes or DNA sequences in pure unannotated fasta format for overrepresented oligonucleotides that are 8-14bp in length. Results are supplied as two tab-delimited text files with statistics and genomic positions of the located oligonucleotides.

Requirements:

This is a computationally demanding program because of the number of comparisons it is forced to carry out. About 400 million combinations of 8-14bp oligos (4^8 + 4^9 ... + 4^14) so about 1.5GB of RAM is necessary for larger genomes (tested for those up to 10MB). Also, sufficient hard disk space is required: for a run across all public NCBI genomes (currently about 700) with relaxed parameters (eg chi-squared 100, and at least 20 oligos) 20 GB may be required for the output alone. If the programming language Java is correctly installed, OligoCounter should work on all machines which are running a Java Virtual Machine (Linux, Windows and Mac etc).

Troubleshooting;

  • If no more hard disk space exists OligoCounter will crash with a null pointer exception error. Make sure at least 1 GB of free disk space exists before running the program as this is needed for the temporary files it generates.
  • RefSeq accession numbers are parsed from the fasta file and used. The file must contain a RefSeq header even if it is a non-RefSeq genome. These can simply be faked by copying a RefSeq complete fasta header and modifying it. 

For example:

RefSeq header:

>gi|26986745|ref|NC_002947.3| Pseudomonas putida KT2440, complete genome

Changes - the NC_XXXXXX is the key part, but also the | characters need to be left in place

>gi|0000000|ref|NC_111111.1| My genome description

Performance:

On a Pentium 4 3600 running Linux kernel 2.4 this program requires about 3 minutes per megabase to read in, count, sort and output the common oligonucleotides.

1#     Make sure Java 1.5 or above is installed and available on the command line Type java at the command shell

java

If feedback on options is shown Java should be set up correctly !

2#     Put all input fna (fasta files) in a clean working directory.

3#     Place OligoCounter.jar in the same directory as the fna files

4#     Open a shell and change to the working directory

5#     Run OligoCounter

java -jar -Xmx1548m OligoCounter.jar

The -Xmx tag allows java to use a larger amount of memory, in this case 1548MB

6#     You are confronted with the OligoCounter text menu:

Enter a number to change an option
Options:
0: Lower oligo frequency threshold: 70
1: Chi squared significance threshold: 3000
2: Second Chi squared significance threshold: 0
3: Heuristic version, saves memory, 1 indicates on, 0 off: 0
4: Heuristic frequently removes oligos with counts below : 2
8: Help
9 to exit this menu and start the program

Enter the relevant number to change an option, then press enter, then enter a new value for that option See the help for more details on the options. A chi squared value of 500 might be a good place to start for most genomes.

8

7#     For example, enter 0 to change the Lower oligo frequency threshold

0

Enter a new value for the lower threshold: oligos which occur less frequently than this value in the genome will be ignored (after all oligos in the whole genome have been counted) in subsequent analysis steps:

20
Enter a number to change an option
Options:
0: Lower oligo frequency threshold: 20
1: Chi squared significance threshold: 3000
2: Second Chi squared significance threshold: 0
3: Heuristic version, saves memory, 1 indicates on, 0 off: 0
4: Heuristic frequently removes oligos with counts below : 2
8: Help
9 to exit this menu and start the program

Explanation of all OligoCounter parameters

8#     When all values are set to your satisfaction enter 9 to start the program

9

9#     OligoCounter will now run through all fasta files located in the same directory with .fna file extensions.

10#     Output files (tab-delimited text) are created in the same directory

Files include a prefix:

resultsStats + extension
resultsPositions + extension

The extension is RefSeqGenus_species_chiSq.txt
eg NC_007005Pseudomonas_syrin_3000.txt

Stats files contain the oligo, number of times it is present and statistics

Positions files contain the oligo, and the position where each oligo instance begins

Fasta ouput files simply contain the oligo in fasta format - these can be useful for multiple sequence alignments