oligocounter logo
Home

Services







banner

OligoCounter parameters

Parameters in the OligoCounter text menu:

Enter a number to change an option
Options:
0: Lower oligo frequency threshold: 70
1: Chi squared significance threshold: 3000
2: Second Chi squared significance threshold: 0
3: Heuristic version, saves memory, 1 indicates on, 0 off: 0
4: Heuristic frequently removes oligos with counts below : 2
6: Create fasta output files - 1 indicates on, 0 off: 0
8: Help
10: Report positions of Ns: 0
9 to exit this menu and start the program

For a quick explanation, see the example at the bottom of this page.

Lower oligo frequency threshold

This is the minimum number of instances an oligo needs to have (i.e. times it must occur in the genome) to be included further analysis. 

This is the first threshold to be applied to the data. Set this threshold very low, perhaps 10 or 20, if you are analysing small genomes (i.e. less than 1MB).

Chi squared significance threshold

OligoCounter typically finds millions of oligos in a genome. To decide which of these can be termed overrepresented, we used a statistical approach with chi-squared.

Chi-squared statistics were calculated according to the formula: 

(observed count – expected count) ^ 2 / expected count. 

Expected counts E of an oligonucleotide in a genome were derived by a zero order markov model 

E = N * A^a * C^c * G^g * T^t

where 

N is the genome size in nucleotides

A is the proportion of adenine in the genome and 

a is the number of adenines in the oligo (and so on for C, G and T)

The chi-squared statistic is not intended to be an indicator of statistical significance but merely of level of overrepresentation of each oligo, otherwise Bonferroni 

corrections for multiple tests would have been additionally carried out.

Second Chi squared significance threshold

As above. This option allows chi squared data at another threshold level to be filtered from the dataset. It has the advantage that the genome does not have to be read in again to get data - 

it is a lot faster to get two sets of output files after reading the genome in once than to run OligoCounter twice, each time with one chi-squared threshold.

Heuristic version, saves memory, 1 indicates on, 0 off

The heuristic version of OligoCounter removes oligos from memory (after every 100kbp) while scanning through the genome if they are present less than x times. x is set in the next option (4). 

While saving memory, this option is mainly useful for just finding very highly overrepresented words since it makes assumptions about the distribution of oligos which are present less than x times per 100kbp. 

Heuristic frequently removes oligos with counts below : 2

See above option for details : this sets the heuristic option x


A parameters example

OligoCounter Readout:

Settings saved, starting OligoCounter 
List of files to be analysed [AE007871_modified.fna]

>gi|16445347|gb|AE007871.1| Agrobacterium tumefaciens str. C58 plasmid Ti, complete genome
The genome file should have been read in and the number encoded intermediate results files called temp_results.txt and resultsHash.txt should have been successfully created in the same directory if no errors occurred in this process 
1198192 distinct oligos were found in the genome
3196 were found in the genome more than 10 times
Total nucleotides counted: A 46548 T 46218 G 61549 C 59792 Total 214107
run 1 chiThreshold 100
Genome GC content is : 56.673
Statistical results are now being sorted 
Statistical results are now being printed to file resultsStats.txt 
Starting the sort mechanism to sort oligos
Number of oligos above the chi-squared threshold: 22
Positional results are now being sorted by chi squared value
OligoCounter completed this genome successfully provided non-empty files were created; 
if files are empty check your input settings first (especially lower threshold) 

Explanation

1198192 distinct oligos were found in the genome
We have over one million hits : these are the unfiltered oligos found in the genome, most of which are only present one or several times.

3196 were found in the genome more than 10 times

The "lower oligo frequency threshold" has now been implemented and leaves roughly 3000 oligos.

run 1 chiThreshold 100

A chi squared significance threshold of 100 was used 

Number of oligos above the chi-squared threshold: 22

The applied chi-squared level leaves 22 oligos - these oligos and their genomic positions are then sorted and printed to file.