QuickParanoid - A Tool for Ortholog Clustering


Introduction

QuickParanoid is a suite of programs for automatic ortholog clustering and analysis. It takes as input a collection of files produced by InParanoid and finds ortholog clusters among multiple species. For a given dataset, QuickParanoid first preprocesses each InParanoid output file and then computes ortholog clusters. It also provides a couple of programs qa1 and qa2 for analyzing the result of ortholog clustering.

QuickParanoid is similar to MultiParanoid and OrthoMCL in functionality, but is much faster. For example, it takes only 199.56 seconds on an Intel 2.4Ghz machine with 1 gigabyte memory to process a dataset of 120 species (which contains 14403218 entries in 120 * 119 / 2 = 7140 InParanoid output files of a total size of 365.38 megabytes). In comparison, MultiParanoid on the same machine fails to process a dataset of 20 species (which contains 319368 entries), and OrthoMCL fails to process a dataset of 60 species (which contains 3245394 entries). The accuracy is also comparable with MultiParanoid and OrthoMCL. For example, for a dataset of 3 species, QuickParanoid finds 135 clusters in the manually curated data consisting of 221 clusters while MultiParanoid and OrthoMCL find 137 clusters and 98 clusters, respectively.

Here is the result of testing the speed and memory usage of the three programs using eight different datasets. All experiments were performed on an Intel 2.4Ghz machine running Debian Linux 2.6.21-6 with 1 gigabyte memory. Memory usage for MultiParanoid and OrthoMCL was measured using top. Note that in the experiment with 120 species, QuickParanoid finds a cluster consisting of sequences from all 120 species!

Number of species    Dataset size    Number of entries in the dataset    QuickParanoid [summary]   
(running time)
(memory usage)
(number of clusters found)
MultiParanoid   
(running time)
(memory usage)
(number of clusters found)
OrthoMCL   
(running time)
(memory usage)
(number of clusters found)
Number of clusters found by QuickParanoid and MultiParanoid    Number of clusters found by QuickParanoid and OrthoMCL   
5    0.38 Mbytes    14664   0.15 seconds  
600 Kbytes   
2208 clusters   
2.03 seconds   
38 Mbytes   
2293 clusters   
35 seconds   
31.00 Mbytes   
2787 clusters   
2091 clusters    1372 clusters   
10    1.86 Mbytes    71035    0.51 seconds  
1584 Kbytes   
3034 clusters   
48.78 seconds  
140 Mbytes   
3218 clusters   
175 seconds   
61.75 Mbytes   
4466 clusters   
2737 clusters    1882 clusters   
15    4.24 Mbytes    164173    1.09 seconds  
3117 Kbytes   
4242 clusters   
5107.86 seconds   
314 Mbytes   
4515 clusters   
600 seconds   
112.45 Mbytes   
6849 clusters   
3767 clusters    2751 clusters   
20    8.19 Mbytes    319368    2.75 seconds  
4764 Kbytes   
4934 clusters   
∞   
-
-
1150 seconds  
186.50 Mbytes   
8477 clusters   
-    3259 clusters   
40    35.14 Mbytes    1407029    13.10 seconds  
28655 Kbytes  
9003 clusters  
∞   
-
-
13513 seconds  
722.75 Mbytes  
18686 clusters  
-    5539 clusters   
60    81.48 Mbytes    3245394    40.32 seconds  
94425 Kbytes  
18830 clusters  
∞   
-
-
-    -    -   
90    225.07 Mbytes    8865949    84.12 seconds  
245024 Kbytes   
27199 clusters   
∞   
-
-
-    -    -   
120    365.38 Mbytes    14403218    199.56 seconds   
335972 Kbytes   
29379 clusters   
∞   
-
-
-    -    -   

Here is the result of testing the accuracy of the three programs using a dataset of three species (human, fly, worm) for which manually curated data are available. Each entry in the table denotes the number of clusters.

Manually curated data (A)    QuickParanoid (B)    MultiParanoid (C)    OrthoMCL (D)    A ∩ B    A ∩ C    A ∩ D    A ∩ B ∩ C    A ∩ B ∩ D   
221    5620    5722    5635    135    137    98    135    92   

Among 5620 clusters found by QuickParanoid, 98.52% (5537 clusters) exactly match those found by MultiParanoid. The following graph shows the distribution of clusters found by QuickParanoid against their similarity with those found by MultiParanoid (in logarithmic scale). A similarity value p means that p of sequences in a cluster found by QuickParanoid are included in some similar cluster found by MultiParanoid. p = 1.0 means that exactly the same cluster is found both by QuickParanoind and by MultiParanoid.


Installation


Quick guide to QuickParanoid


Usage


Datasets


Extending QuickParanoid

QuickParanoid preprocesses InParanoid output files to replace all string operations by much faster integer operations. If you wish to write your own program for analyzing InParanoid output files (e.g., another ortholog clustering program), you can use our code for preprocessing InParanoid output files so that you can concentrate on implementing your algorithm rather than handling input/output. This section explains how to use our code when writing such a program.

If you follow the instruction in the Usage section to specify the dataset directory, the data file prefix, etc., a new header file __ortholog.h is created. Part of __ortholog.h that declares structures for reading InParanoid output files is as follows:

// sequence ---------------------------------------
typedef struct{
  int species_id;
  double seed;
  int sequence_id;
} sequence;

// cluster ----------------------------------------
typedef struct{
  int score;
  int num_of_sequences;
  sequence *sequences;
} cluster;

// dataFile ----------------------------------------
// invariant: species_id1 must be prior to species_id2 in the species table.
typedef struct{
  int species_id1;
  int species_id2;
  int num_of_clusters;
  cluster *clusters;
} dataFile;

//-------------------------------------------------------
// load_dataFile takes a pair of species ids and a dataFile
//           and parses and loads the data file of two species.
// invariant: species_id1 < species_id2
void load_dataFile(int species_id1, int species_id2, dataFile* data);

// free_dataFile takes a pointer of a datafile and frees all memory.
void free_dataFile(dataFile* data);
__ortholog.h also declares arrays for converting species numbers to species names and sequence numbers to sequence names. Perhaps your program need to access these arrays when generating output files.
// num of species
#define  NUM_OF_SPECIES 3

// species id
// invariant: species id starts at 0.
#define _fly2k    0
#define _human2k    1
#define _worm2k   2

// species table
static const int species_table [] =
  {_fly2k, _human2k, _worm2k};

// species names
// usage: species_names [species id]
static const char* species_names [] =
  {"fly2k", "human2k", "worm2k"};

// num of sequence
#define  NUM_OF_SEQUENCES 28494

// sequence names
// usage: sequence_names [sequence id]
char sequence_names[NUM_OF_SEQUENCES][15];
If you need to retrieve a species number from a species name (a character string), use the hash table ht_speciesName2Id. To retrieve a sequence number from a sequence name (also a character string), use the hash table ht_seqName2Id. Both hash tables are declared in __ortholog.h, and hashtable.h explains how to search these hash tables.

Your program should include __ortholog.h. An easy way to do this is to include the following two lines:

#include "qp.h"
#include INTERMEDIATE_HEADER_FILE

You can compile your program in the same way that qp.c is compiled. If your program is foo.c, you can compile it as follows:

gcc hashtable.o hashtable_itr.o ortholog.o foo.c -lm
That's it!


Email us at .

Programming Language Laboratory
Department of Computer Science and Engineering
Pohang University of Science and Technology
Republic of Korea