QuickParanoid is similar to MultiParanoid and OrthoMCL in functionality, but is much faster. For example, it takes only 199.56 seconds on an Intel 2.4Ghz machine with 1 gigabyte memory to process a dataset of 120 species (which contains 14403218 entries in 120 * 119 / 2 = 7140 InParanoid output files of a total size of 365.38 megabytes). In comparison, MultiParanoid on the same machine fails to process a dataset of 20 species (which contains 319368 entries), and OrthoMCL fails to process a dataset of 60 species (which contains 3245394 entries). The accuracy is also comparable with MultiParanoid and OrthoMCL. For example, for a dataset of 3 species, QuickParanoid finds 135 clusters in the manually curated data consisting of 221 clusters while MultiParanoid and OrthoMCL find 137 clusters and 98 clusters, respectively.
Here is the result of testing the speed and memory usage of the three programs using eight different datasets. All experiments were performed on an Intel 2.4Ghz machine running Debian Linux 2.6.21-6 with 1 gigabyte memory. Memory usage for MultiParanoid and OrthoMCL was measured using top. Note that in the experiment with 120 species, QuickParanoid finds a cluster consisting of sequences from all 120 species!
Number of species | Dataset size | Number of entries in the dataset | QuickParanoid [summary]
(running time) (memory usage) (number of clusters found) |
MultiParanoid
(running time) (memory usage) (number of clusters found) |
OrthoMCL
(running time) (memory usage) (number of clusters found) |
Number of clusters found by QuickParanoid and MultiParanoid | Number of clusters found by QuickParanoid and OrthoMCL |
5 | 0.38 Mbytes | 14664 | 0.15 seconds 600 Kbytes 2208 clusters |
2.03 seconds 38 Mbytes 2293 clusters |
35 seconds 31.00 Mbytes 2787 clusters |
2091 clusters | 1372 clusters |
10 | 1.86 Mbytes | 71035 | 0.51 seconds 1584 Kbytes 3034 clusters |
48.78 seconds 140 Mbytes 3218 clusters |
175 seconds 61.75 Mbytes 4466 clusters |
2737 clusters | 1882 clusters |
15 | 4.24 Mbytes | 164173 | 1.09 seconds 3117 Kbytes 4242 clusters |
5107.86 seconds 314 Mbytes 4515 clusters |
600 seconds 112.45 Mbytes 6849 clusters |
3767 clusters | 2751 clusters |
20 | 8.19 Mbytes | 319368 | 2.75 seconds 4764 Kbytes 4934 clusters |
∞
- - |
1150 seconds 186.50 Mbytes 8477 clusters |
- | 3259 clusters |
40 | 35.14 Mbytes | 1407029 | 13.10 seconds 28655 Kbytes 9003 clusters |
∞
- - | 13513 seconds 722.75 Mbytes 18686 clusters |
- | 5539 clusters |
60 | 81.48 Mbytes | 3245394 | 40.32 seconds 94425 Kbytes 18830 clusters |
∞
- - |
- | - | - |
90 | 225.07 Mbytes | 8865949 | 84.12 seconds 245024 Kbytes 27199 clusters |
∞
- - |
- | - | - |
120 | 365.38 Mbytes | 14403218 | 199.56 seconds 335972 Kbytes 29379 clusters |
∞
- - |
- | - | - |
Here is the result of testing the accuracy of the three programs using a dataset of three species (human, fly, worm) for which manually curated data are available. Each entry in the table denotes the number of clusters.
Manually curated data (A) | QuickParanoid (B) | MultiParanoid (C) | OrthoMCL (D) | A ∩ B | A ∩ C | A ∩ D | A ∩ B ∩ C | A ∩ B ∩ D |
221 | 5620 | 5722 | 5635 | 135 | 137 | 98 | 135 | 92 |
Among 5620 clusters found by QuickParanoid, 98.52% (5537 clusters) exactly match those found by MultiParanoid. The following graph shows the distribution of clusters found by QuickParanoid against their similarity with those found by MultiParanoid (in logarithmic scale). A similarity value p means that p of sequences in a cluster found by QuickParanoid are included in some similar cluster found by MultiParanoid. p = 1.0 means that exactly the same cluster is found both by QuickParanoind and by MultiParanoid.
[gla@plquad:40 ] make qa g++ -o qa1 qa1.cpp gcc -c hashtable.c -o hashtable.o g++ -o qa2 hashtable.o qa2.cppIf successful, you are ready to use QuickParanoid. If not, make sure that both gcc and make are installed on your system, and edit Makefile appropriately (two variables CC and CPP in it).
[gla@plquad:43 ] qp ===================================================== QuickParanoid ===================================================== Dataset directory [default = "." (current directory)]: fly-human-worm Data file prefix [default = "sqltable."]: Data file separator [default = "-"]: Configuration file [default = "fly-human-worm/config"]: Executable file prefix [default = "test"]: Generating a header file..... Updating Makefile..... Generating executable files...... g++ -o dump dump.cpp ./dump fly-human-worm/config Reading the config file Reading the data files ... g++ -o gen_header gen_header.cpp ./gen_header fly-human-worm/config __ortholog.h Reading the config file Reading the data files Opening fly-human-worm/sqltable.fly2k-human2k Opening fly-human-worm/sqltable.fly2k-worm2k Opening fly-human-worm/sqltable.human2k-worm2k Generating structure definitions Generating functions Generating species Generating sequences gcc -c ortholog.c -o ortholog.o gcc -c hashtable_itr.c -o hashtable_itr.o gcc -o test hashtable.o hashtable_itr.o ortholog.o qp.c -lm gcc -o tests qps.c Done. Run test to perform ortholog clustering. Run tests to see the dataset size and the number of entries.
[gla@plquad:44 ] test > result.txt [gla@plquad:45 ] tests Dataset size in bytes: 1404502 Number of entries in the dataset: 44752
Each line in the standard InParanoid output format consists of (1) cluster ID number; (2) InParanoid score; (3) species name; (4) seed score; (5) sequence name, as in the following example:
3 4425 fly2k 1.000 gi7303993Optionally each line may contain a bootstrap value in an extra column after the sequence name, as in the following example:
11 3607 ensMONDO.fa 1.000 ENSMODP00000019772 100%Bootstrap values are not used by QuickParanoid and are ignored.
[gla@plquad:49 ] cat fly-human-worm/config fly2k human2k worm2kYou may list the names of species in any order as long as you provide a dataset of N (N-1)/2 files for N species. The order of species does not affect the result of ortholog clustering. The only exception is that it may change the type of tree conflict (different-by-numbers or different-by-names) for those clusters with both types of tree conflicts, since QuickParanoid reports whichever type of tree conflict is detected first.
QuickParanoid assumes that all data files reside in the dataset directory and that every data file in it has a name concatenating the data file prefix, a species name, the data file separator, and another species name in that order. It also assumes that a configuration file can be located. All executable files begin with the executable file prefix. For example,
Dataset directory [default = "." (current directory)]: fly-human-worm Data file prefix [default = "sqltable."]: Data file separator [default = "-"]: Configuration file [default = "fly-human-worm/config"]: Executable file prefix [default = "test"]:the data files reside in directory fly-human-worm, and sqltable.fly2k-human2k is the data file corresponding to species fly2k and human2k; a configuration file fly-human-worm/config is used, and two programs test and tests are generated. For the sake of parsing file names correctly, no species name should contain the data file separator.
For each pair of species, the dataset must have at least one InParanoid output file. For example, if data file prefix is sqltable. and data file separator is -, two species A and B must have at least one of InParanoid output files sqltable.A-B and sqltable.B-A.
[gla@plquad:44 ] test > result.txt [gla@plquad:45 ] tests Dataset size in bytes: 1404502 Number of entries in the dataset: 44752
[gla@plquad:50 ] qa1 Usage: qa1The output of qa2 is self-explanatory. [analysis1.txt, analysis2.txt]
Uncompressing these files creates a subdirectory fly-human-worm. Manual.out is the manually curated data from the three species. It contains a total of 221 clusters. You may ignore the fourth column (is_seed_ortholog) to the last column (tree_conflict).
This dataset is provided by the authors of MultiParanoid (Andrey Alexeyenko, Ivica Tamas, Gang Liu, and Erik L.L. Sonnhammer (2006). Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics 22: e9-e15).
Uncompressing the tar file creates a subdirectory 120species. It contains 8 configuration files config.5, config.10, config.15, config.20, config.30, config.60, config.90, and config.120. The suffix of each configuration file indicates the number of species in it.
This dataset was built from data available at NCBI GenBank (ftp://ftp.ncbi.nlm.nih.gov/genomes) using InParanoid.
If you follow the instruction in the Usage section to specify the dataset directory, the data file prefix, etc., a new header file __ortholog.h is created. Part of __ortholog.h that declares structures for reading InParanoid output files is as follows:
// sequence --------------------------------------- typedef struct{ int species_id; double seed; int sequence_id; } sequence; // cluster ---------------------------------------- typedef struct{ int score; int num_of_sequences; sequence *sequences; } cluster; // dataFile ---------------------------------------- // invariant: species_id1 must be prior to species_id2 in the species table. typedef struct{ int species_id1; int species_id2; int num_of_clusters; cluster *clusters; } dataFile; //------------------------------------------------------- // load_dataFile takes a pair of species ids and a dataFile // and parses and loads the data file of two species. // invariant: species_id1 < species_id2 void load_dataFile(int species_id1, int species_id2, dataFile* data); // free_dataFile takes a pointer of a datafile and frees all memory. void free_dataFile(dataFile* data);
d = (dataFile*)malloc(sizeof(dataFile)); load_dataFile(s1, s2, d);
// num of species #define NUM_OF_SPECIES 3 // species id // invariant: species id starts at 0. #define _fly2k 0 #define _human2k 1 #define _worm2k 2 // species table static const int species_table [] = {_fly2k, _human2k, _worm2k}; // species names // usage: species_names [species id] static const char* species_names [] = {"fly2k", "human2k", "worm2k"}; // num of sequence #define NUM_OF_SEQUENCES 28494 // sequence names // usage: sequence_names [sequence id] char sequence_names[NUM_OF_SEQUENCES][15];
Your program should include __ortholog.h. An easy way to do this is to include the following two lines:
#include "qp.h" #include INTERMEDIATE_HEADER_FILE
You can compile your program in the same way that qp.c is compiled. If your program is foo.c, you can compile it as follows:
gcc hashtable.o hashtable_itr.o ortholog.o foo.c -lmThat's it!
Programming Language Laboratory Department of Computer Science and Engineering Pohang University of Science and Technology Republic of Korea