Electronically inferred events in ReactomeWe use the set of manually curated human reactions to electronically infer reactions in twenty two evolutionarily divergent species for which high-quality whole-genome sequence data are available, and hence a comprehensive and high-quality set of protein predictions exists. These species include the laboratory mouse and rat, the nematode C. elegans, budding and fission yeasts, two plants and several bacteria. The estimated success rates of our orthology inference strategy can be stated as 'the percentage of eligible reactions, defined in step 2 below, in the current human data set for which an event can be inferred in the model organism'. By this measure, success rates range from 96.80% for the laboratory mouse to 7.55% for the archaebacterium Methanococcus jannaschii. Electronic inference proceeds in four steps. 1) Protein orthology data were obtained from the OrthoMCL DB, Version 2. Briefly, the OrthoMCL clustering procedure ( Li et al. (2003), Feng et al. (2006)) works as follows: An all-against-all BLASTP is performed on all human and model organism proteins. Reciprocal best similarity pairs between species, and reciprocal better similarity pairs within species (i.e., recently arisen paralogues, proteins that are more similar to each other within one species than to any protein in the other species) are entered into a similarity matrix. The matrix is normalized by species and subjected to Markov clustering to generate orthologue groups including recent paralogues. The OrthoMCL clustering procedure was changed between versions 1 and 2 to consider only the longest transcript of each gene. We in turn have switched to a gene-based rather than a protein-based method for mapping the Ensembl identifiers used by OrthoMCL to the UniProt accessions used in Reactome. This change has improved our success rate for electronic inference, with no measurable effect on accuracy. 2) All human reactions in the Reactome knowledgebase involving one or more proteins are eligible for electronic inference, with two exceptions. Reactions that were themselves inferred based on data from the model organism, and reactions involving species in addition to human (e.g., HIV infection of human cells) are excluded from electronic inference. Eligible reactions are checked to determine whether each involved protein has at least one OrthoMCL orthologue or recent paralogue (OP) in the model organism. If a human reaction involves a complex, at least 75% of the accessioned protein components of the human complex must have OPs in the model organism. 3) For each reaction that meets these criteria, an equivalent reaction is created for the model organism by replacing each human protein with its model organism OP. If a human protein corresponds to more than one model organism OP, a DefinedSet called 'Homologues of ...' is created, with the model organism OPs as members. For human proteins that lack a model organism OP but that are included in complexes inferred due to the 75% threshold rule, placeholder model organism entities (called 'Ghost homologue of...') are created. 4) If this analysis generates reactions in the model organism corresponding to any of the steps of a human pathway, then the pathway event is also inferred for the model organism. These electronically inferred reactions are predictions based on a number of assumptions. Most basically, we assume that if we can find model organism OPs corresponding to all proteins involved in a human reaction, then the proteins mediate the same reaction in the model organism. This may not be true. On the other hand we may miss a truly orthologous reaction in the model organism because it is mediated by structurally divergent proteins and the OrthoMCL strategy failed to identify them. Similarly, complexes sharing less than 75% orthologous subunits between species may nevertheless continue to perform the same function. The electronically inferred reactions presented in Reactome are thus not data, but hypotheses useful to direct the design of confirmatory experiments.
|
| Date: 2010-03-13 01:10:26 | Help |