Chalmers University of Gothenburg

Supplementary material for the manuscript titled:

Bayesian Classifiers for Detecting HGT using Fixed and Variable Order Markov models of Genomic Signatures

Authors:
Daniel Dalevi, Devdatt Dubhashi and Malte Hermansson


Data sets

A list of the 28 bacteria, same as in Sanberg et al. (2001) (personal communication), used when comparing the performance is found here. All names are according to the naming of folders at Genbank ftp-site of bacteria. The larger set of 157 bacteria, obtained by taking all available at Genbank and removing some of the similar strains, are named similarly and can be viewed here. Note, in the classification we do not distinguish between strains of bacteria. E.g. genes from Helicobacter_pylori_26695 will be classified correctly if grouping with any of Helicobacter_pylori_26695 and Helicobacter_pylori_J99

Reproducing results

In order to reproduce the results presented in the article we refer to the following links.

The data-set of 28 taxa:
  1. Creating profiles (90% randomly selected region), here.
  2. Scoring the test-set (remaining 10%) against the profiles, here.

The data-set of 157 bacteria.
  1. Creating profiles. Run this bash-script.
  2. Score against profiles. Run this bash-script.
Note, both creating the profiles and scoring can be divided into smaller lists. Will save time if a multi-processor machine is available. The full list of all 157 bacteria needs to be specified by the "-f_f" used in the "-sap" option, however, the loop in the bash-shell can work on a subset of the list.

Comparing performance

The values of comparing performance using the data-set (28 and 157 bacteria in Figure 1 can be viewed here: (a), (b), (c), (d), (e) and (f)

False negatives I

The list of backbone genes. Note, this test was only performed on the smaller dataset of 28 taxa and the six plasmids from the IncP-&beta group.

False negatives II

This list of genes, all "positional conserved" between E-coli and Salmonella that are longer than 1000 nucleotides were classified into the groups of 157 bacteria. For further details about the data we refer to Koski et al. (2001). All hits to Escherichia coli, Salmonella typhi and Shigella flexneri were considered correctly classified. Shigella flexneri are by many considered to be the same species as Escherichia coli, e.g.Brenner (1984).

Simulations

Results from simulations of artificial bacterial genomes was left out due to space limitations (Section 3.5). The figure shows a comparison of the three methods on simulated data. The size of the models in terms of number of parameters corresponds to that of tetramers. There is little doubt that the variable length Markov model (vlm3) is much more efficient in capturing dependencies in data. It outperforms both the naive (n3) and the fixed order Markov (m3) models. The additional patterns and bacteria used in training are found here.

Source code to produce software binaries

Visit the software site here, or, simply,
  1. Download the tar-archive, here.
  2. gunzip classifier.tar.gz
  3. tar xvf classifier.tar
  4. cd classifier
  5. make
  6. If you encounter problems feel free to send an email, dalevi@cs.chalmers.se

Speed of program

The following running times were obtained on a Compaq lap-top (N800C P4-2 2GHZ 40GB 256MB). The data-set was a sequence of length 3954 nucleotides. The running time is very dependent on the settings and the size of the sequences. Run-time on larger sequences, such as 90% of a bacterial genome, will require more time.
Settings Time
Creating a fixed profile (order = 1)0.031s
Creating a fixed profile (order = 2)0.046s
Creating a fixed profile (order = 3)0.079s
Creating a fixed profile (order = 4)0.193s
Creating a fixed profile (order = 5)0.540s
Creating a fixed profile (order = 10)3.613s
Creating a PST (number parameters = 12, kmax=10, pseudo-counts)1.781s
Creating a PST (number parameters = 48, kmax=10, pseudo-counts)1.896s
Creating a PST (number parameters = 192, kmax=10, pseudo-counts)1.869s
Creating a PST (number parameters = 768, kmax=10, pseudo-counts)1.951s
Creating a PST (number parameters = 3072, kmax=10, pseudo-counts)2.349s


Computer Science and Engineering, Chalmers University of Technology, SE-412 96 Göteborg, Sweden
Telephone: +46-(0)31 772 1044; Fax: +46-(0)31 165655

Last Modified: 18 January 2006
dalevi@chalmers.se
Chalmers University of Technology Göteborg University