QuickParanoid is similar to MultiParanoid and OrthoMCL in functionality, but is much faster. For example, it takes only 199.56 seconds on an Intel 2.4Ghz machine with 1 gigabyte memory to process a dataset of 120 species (which contains 14403218 entries in 120 * 119 / 2 = 7140 InParanoid output files of a total size of 365.38 megabytes). In comparison, MultiParanoid on the same machine fails to process a dataset of 20 species (which contains 319368 entries), and OrthoMCL fails to process a dataset of 60 species (which contains 3245394 entries). The accuracy is also comparable with MultiParanoid and OrthoMCL. For example, for a dataset of 3 species, QuickParanoid finds 135 clusters in the manually curated data consisting of 221 clusters while MultiParanoid and OrthoMCL find 137 clusters and 98 clusters, respectively.
Here is the result of testing the speed and memory usage of the three programs using eight different datasets. All experiments were performed on an Intel 2.4Ghz machine running Debian Linux 2.6.21-6 with 1 gigabyte memory. Memory usage for MultiParanoid and OrthoMCL was measured using top. Note that in the experiment with 120 species, QuickParanoid finds a cluster consisting of sequences from all 120 species!
Number of species | Dataset size | Number of entries in the dataset | QuickParanoid [summary]
(running time) (memory usage) (number of clusters found) |
MultiParanoid
(running time) (memory usage) (number of clusters found) |
OrthoMCL
(running time) (memory usage) (number of clusters found) |
Number of clusters found by QuickParanoid and MultiParanoid | Number of clusters found by QuickParanoid and OrthoMCL |
5 | 0.38 Mbytes | 14664 | 0.15 seconds 600 Kbytes 2208 clusters |
2.03 seconds 38 Mbytes 2293 clusters |
35 seconds 31.00 Mbytes 2787 clusters |
2091 clusters | 1372 clusters |
10 | 1.86 Mbytes | 71035 | 0.51 seconds 1584 Kbytes 3034 clusters |
48.78 seconds 140 Mbytes 3218 clusters |
175 seconds 61.75 Mbytes 4466 clusters |
2737 clusters | 1882 clusters |
15 | 4.24 Mbytes | 164173 | 1.09 seconds 3117 Kbytes 4242 clusters |
5107.86 seconds 314 Mbytes 4515 clusters |
600 seconds 112.45 Mbytes 6849 clusters |
3767 clusters | 2751 clusters |
20 | 8.19 Mbytes | 319368 | 2.75 seconds 4764 Kbytes 4934 clusters |
∞
- - |
1150 seconds 186.50 Mbytes 8477 clusters |
- | 3259 clusters |
40 | 35.14 Mbytes | 1407029 | 13.10 seconds 28655 Kbytes 9003 clusters |
∞
- - | 13513 seconds 722.75 Mbytes 18686 clusters |
- | 5539 clusters |
60 | 81.48 Mbytes | 3245394 | 40.32 seconds 94425 Kbytes 18830 clusters |
∞
- - |
- | - | - |
90 | 225.07 Mbytes | 8865949 | 84.12 seconds 245024 Kbytes 27199 clusters |
∞
- - |
- | - | - |
120 | 365.38 Mbytes | 14403218 | 199.56 seconds 335972 Kbytes 29379 clusters |
∞
- - |
- | - | - |
Here is the result of testing the accuracy of the three programs using a dataset of three species (human, fly, worm) for which manually curated data are available. Each entry in the table denotes the number of clusters.
Manually curated data (A) | QuickParanoid (B) | MultiParanoid (C) | OrthoMCL (D) | A ∩ B | A ∩ C | A ∩ D | A ∩ B ∩ C | A ∩ B ∩ D |
221 | 5620 | 5722 | 5635 | 135 | 137 | 98 | 135 | 92 |
Among 5620 clusters found by QuickParanoid, 98.52% (5537 clusters) exactly match those found by MultiParanoid. The following graph shows the distribution of clusters found by QuickParanoid against their similarity with those found by MultiParanoid (in logarithmic scale). A similarity value p means that p of sequences in a cluster found by QuickParanoid are included in some similar cluster found by MultiParanoid. p = 1.0 means that exactly the same cluster is found both by QuickParanoind and by MultiParanoid.
[gla@plquad:40 ] make qa g++ -o qa1 qa1.cpp gcc -c hashtable.c -o hashtable.o g++ -o qa2 hashtable.o qa2.cppIf successful, you are ready to use QuickParanoid. If not, make sure that both gcc and make are installed on your system, and edit Makefile appropriately (two variables CC and CPP in it).
[gla@plquad:43 ] qp ===================================================== QuickParanoid ===================================================== Dataset directory [default = "." (current directory)]: fly-human-worm Data file prefix [default = "sqltable."]: Data file separator [default = "-"]: Configuration file [default = "fly-human-worm/config"]: Executable file prefix [default = "test"]: Generating a header file..... Updating Makefile..... Generating executable files...... g++ -o dump dump.cpp ./dump fly-human-worm/config Reading the config file Reading the data files ... g++ -o gen_header gen_header.cpp ./gen_header fly-human-worm/config __ortholog.h Reading the config file Reading the data files Opening fly-human-worm/sqltable.fly2k-human2k Opening fly-human-worm/sqltable.fly2k-worm2k Opening fly-human-worm/sqltable.human2k-worm2k Generating structure definitions Generating functions Generating species Generating sequences gcc -c ortholog.c -o ortholog.o gcc -c hashtable_itr.c -o hashtable_itr.o gcc -o test hashtable.o hashtable_itr.o ortholog.o qp.c -lm gcc -o tests qps.c Done. Run test to perform ortholog clustering. Run tests to see the dataset size and the number of entries.
[gla@plquad:44 ] test > result.txt [gla@plquad:45 ] tests Dataset size in bytes: 1404502 Number of entries in the dataset: 44752
Each line in the standard InParanoid output format consists of (1) cluster ID number; (2) InParanoid score; (3) species name; (4) seed score; (5) sequence name, as in the following example:
3 4425 fly2k 1.000 gi7303993Optionally each line may contain a bootstrap value in an extra column after the sequence name, as in the following example:
11 3607 ensMONDO.fa 1.000 ENSMODP00000019772 100%Bootstrap values are not used by QuickParanoid and are ignored.