QuickParanoid - A Tool for Ortholog Clustering


Introduction

QuickParanoid is a suite of programs for automatic ortholog clustering and analysis. It takes as input a collection of files produced by InParanoid and finds ortholog clusters among multiple species. For a given dataset, QuickParanoid first preprocesses each InParanoid output file and then computes ortholog clusters. It also provides a couple of programs qa1 and qa2 for analyzing the result of ortholog clustering.

QuickParanoid is similar to MultiParanoid and OrthoMCL in functionality, but is much faster. For example, it takes only 199.56 seconds on an Intel 2.4Ghz machine with 1 gigabyte memory to process a dataset of 120 species (which contains 14403218 entries in 120 * 119 / 2 = 7140 InParanoid output files of a total size of 365.38 megabytes). In comparison, MultiParanoid on the same machine fails to process a dataset of 20 species (which contains 319368 entries), and OrthoMCL fails to process a dataset of 60 species (which contains 3245394 entries). The accuracy is also comparable with MultiParanoid and OrthoMCL. For example, for a dataset of 3 species, QuickParanoid finds 135 clusters in the manually curated data consisting of 221 clusters while MultiParanoid and OrthoMCL find 137 clusters and 98 clusters, respectively.

Here is the result of testing the speed and memory usage of the three programs using eight different datasets. All experiments were performed on an Intel 2.4Ghz machine running Debian Linux 2.6.21-6 with 1 gigabyte memory. Memory usage for MultiParanoid and OrthoMCL was measured using top. Note that in the experiment with 120 species, QuickParanoid finds a cluster consisting of sequences from all 120 species!

Number of species    Dataset size    Number of entries in the dataset    QuickParanoid [summary]   
(running time)
(memory usage)
(number of clusters found)
MultiParanoid   
(running time)
(memory usage)
(number of clusters found)
OrthoMCL   
(running time)
(memory usage)
(number of clusters found)
Number of clusters found by QuickParanoid and MultiParanoid    Number of clusters found by QuickParanoid and OrthoMCL   
5    0.38 Mbytes    14664   0.15 seconds  
600 Kbytes   
2208 clusters   
2.03 seconds   
38 Mbytes   
2293 clusters   
35 seconds   
31.00 Mbytes   
2787 clusters   
2091 clusters    1372 clusters   
10    1.86 Mbytes    71035    0.51 seconds  
1584 Kbytes   
3034 clusters   
48.78 seconds  
140 Mbytes   
3218 clusters   
175 seconds   
61.75 Mbytes   
4466 clusters   
2737 clusters    1882 clusters   
15    4.24 Mbytes    164173    1.09 seconds  
3117 Kbytes   
4242 clusters   
5107.86 seconds   
314 Mbytes   
4515 clusters   
600 seconds   
112.45 Mbytes   
6849 clusters   
3767 clusters    2751 clusters   
20    8.19 Mbytes    319368    2.75 seconds  
4764 Kbytes   
4934 clusters   
∞   
-
-
1150 seconds  
186.50 Mbytes   
8477 clusters   
-    3259 clusters   
40    35.14 Mbytes    1407029    13.10 seconds  
28655 Kbytes  
9003 clusters  
∞   
-
-
13513 seconds  
722.75 Mbytes  
18686 clusters  
-    5539 clusters   
60    81.48 Mbytes    3245394    40.32 seconds  
94425 Kbytes  
18830 clusters  
∞   
-
-
-    -    -   
90    225.07 Mbytes    8865949    84.12 seconds  
245024 Kbytes   
27199 clusters   
∞   
-
-
-    -    -   
120    365.38 Mbytes    14403218    199.56 seconds   
335972 Kbytes   
29379 clusters   
∞   
-
-
-    -    -   

Here is the result of testing the accuracy of the three programs using a dataset of three species (human, fly, worm) for which manually curated data are available. Each entry in the table denotes the number of clusters.

Manually curated data (A)    QuickParanoid (B)    MultiParanoid (C)    OrthoMCL (D)    A ∩ B    A ∩ C    A ∩ D    A ∩ B ∩ C    A ∩ B ∩ D   
221    5620    5722    5635    135    137    98    135    92   

Among 5620 clusters found by QuickParanoid, 98.52% (5537 clusters) exactly match those found by MultiParanoid. The following graph shows the distribution of clusters found by QuickParanoid against their similarity with those found by MultiParanoid (in logarithmic scale). A similarity value p means that p of sequences in a cluster found by QuickParanoid are included in some similar cluster found by MultiParanoid. p = 1.0 means that exactly the same cluster is found both by QuickParanoind and by MultiParanoid.


Installation


Quick guide to QuickParanoid


Usage