Documentation for Protein-containing complex annotation for CoNECo

Named Entity Annotation

General guidelines

  • Entities that fall within the scope of the Gene Ontology term protein-containing complex are the target of named entity annotations for CoNECo.

Excerpt taken from Gene Ontology: A protein complex in this context is meant as a stable set of interacting proteins which can be co-purified by an acceptable method, and where the complex has been shown to exist as an isolated, functional unit in vivo. Acceptable experimental methods include stringent protein purification followed by detection of protein interaction. The following methods should be considered non-acceptable: simple immunoprecipitation, pull-down experiments from cell extracts without further purification, colocalization and 2-hybrid screening. Interactions that should not be captured as protein complexes include: 1) enzyme/substrate, receptor/ligand or any similar transient interactions, unless these are a critical part of the complex assembly or are required e.g. for the receptor to be functional; 2) proteins associated in a pull-down/co-immunoprecipitation assay with no functional link or any evidence that this is a defined biological entity rather than a loose-affinity complex; 3) any complex where the only evidence is based on genetic interaction data; 4) partial complexes, where some subunits (e.g. transmembrane ones) cannot be expressed as recombinant proteins and are excluded from experiments (in this case, independent evidence is necessary to find out the composition of the full complex, if known). Interactions that may be captured as protein complexes include: 1) enzyme/substrate or receptor/ligand if the complex can only assemble and become functional in the presence of both classes of subunits; 2) complexes where one of the members has not been shown to be physically linked to the other(s), but is a homologue of, and has the same functionality as, a protein that has been experimentally demonstrated to form a complex with the other member(s); 3) complexes whose existence is accepted based on localization and pharmacological studies, but for which experimental evidence is not yet available for the complex as a whole.

  • If a term is found in Gene ontology but it is NOT a protein-containing complex, then it will NOT be considered a Complex in this effort (e.g. ribosome).
  • There is a list of groups of Complexes which are protein-containing complex in GO, but will not be annotated as Complex in this effort. The list is in the end of the page.
  • Gene Ontology-Cellular Component (GO-CC) and specifically GO terms under protein-containing complex will be the target for Named Entity Normalization.
  • If a protein-containing complex can be identified as such by the annotators, but there is no entry in Gene Ontology corresponding to the entity then it will receive an annotation as a named entity (NE), but will not have a normalization to GO-CC.
  • The first resource that is trusted to resolve issues is Gene Ontology. If there is still not enough information there, inconsistencies will be resolved using Complex Portal, Reactome, CORUM, and as a last result the literature.

Annotation span

For each mention of a protein-containing complex (Complex hereafter) name, the annotation aims to mark the minimal span containing the full name of the entity mentioned in the text so that the marked span starts and ends on a boundary between an alphanumeric string and a non-alphanumeric character (e.g. space or hyphen). The following provides examples and guidelines for exceptional cases.

Modifiers and head words that are not part of the name are excluded from the annotated span, for example (annotated span in bold)

  • human NFkappaB complex
  • phosphorylated RNA polymerase II

This includes the word “complex” which should not be annotated as part of the entity.

Affixes and markers are similarly excluded from the annotation span even when they are part of the same syntactic word with a Complex name as long as there is a separating nonalphanumeric character, for example (annotated span in bold):

  • phospho-RNA polymerase II

However, annotation boundaries must coincide with the boundary between an alphanumeric and a non-alphanumeric character. In cases where a Complex name is written with one of the following regular affixes without such a boundary, the affix is included in the annotated span:

  • species identifier, e.g. h for human or m for mouse
  • p for phosphorylated
  • wt for wild-type
  • si for small interfering RNA
  • sh for small/short hairpin RNA
  • anti for antibody

Thus, for example (annotated span in bold):

  • the human complex hNFkappaB, and the murine complex mNFkappaB
  • pRNApolII (phosphor RNA polymerase II)

Specific rules for NER

  • When a protein-containing complex name coincides with the name of a GGP or the GGPs comprising it or the name of a Protein Family, separated by ANY punctuation, annotation of the Protein or Protein Family named entities is preferred over annotation of the protein-containing complex entity

  • ERMES complex (otherwise known as Mdm10/Mdm12/Mmm1 complex)

In the example above Mdm10/Mdm12/Mmm1 are not annotated as a single Complex entity, as they constitute three separate Protein NEs

Two notable exceptions are Arp2/3 and SWI/SNF where a single Protein-containing complex NE is annotated instead.

  • Annotations should be applied to all variants of a Complex name: e.g. NF kappaB, NF-kappaB, NFkappaB should all be marked as Protein-containing complex

  • PDB identifiers (e.g. 3BMP, 4BQ6) will NOT be annotated as Protein-containing complex even if when checked against PDB they correspond to one.

List of GO terms that will not be annotated as Complex

  • GO:1990104 DNA bending complex
  • GO:0140535 intracellular protein-containing complex
  • GO:1902494 catalytic complex
  • GO:0070069 cytochrome complex
  • GO:0031074 nucleocytoplasmic transport complex
  • GO:0005875 microtubule associated complex
  • GO:0001114 protein-DNA-RNA complex
  • GO:0150005 enzyme activator complex
  • GO:0019907 cyclin-dependent protein kinase activating kinase holoenzyme complex
  • GO:0044796 DNA polymerase processivity factor complex
  • GO:0098636 protein complex involved in cell adhesion
  • GO:0098635 protein complex involved in cell-cell adhesion
  • GO:0098637 protein complex involved in cell-matrix adhesion
  • GO:0043235 receptor complex
  • GO:0140368 decoy receptor complex
  • GO:1903768 taste receptor complex
  • GO:0098666 G protein-coupled serotonin receptor complex
  • GO:1990563 extracellular exosome complex
  • GO:0019036 viral transcriptional complex
  • GO:0098796 membrane protein complex
  • GO:0098803 respiratory chain complex
  • GO:1990684 protein-lipid-RNA complex
  • GO:1990686 LDL-containing protein-lipid-RNA complex
  • GO:1990685 HDL-containing protein-lipid-RNA complex
  • GO:0140513 nuclear protein-containing complex
  • GO:0035097 histone methyltransferase complex
  • GO:0000109 nucleotide-excision repair complex
  • GO:0008023 transcription elongation factor complex
  • GO:0005849 mRNA cleavage factor complex
  • GO:0000152 nuclear ubiquitin ligase complex
  • GO:0032994 protein-lipid complex
  • GO:0060987 lipid tube
  • GO:1990777 lipoprotein particle
  • GO:0030076 light-harvesting complex
  • GO:0098798 mitochondrial protein-containing complex
  • GO:0090665 glycoprotein complex
  • GO:1990351 transporter complex
  • GO:0005667 transcription regulator complex
  • GO:0017053 transcription repressor complex
  • GO:0090576 RNA polymerase III transcription regulator complex
  • GO:0090577 RNA polymerase IV transcription regulator complex
  • GO:0090578 RNA polymerase V transcription regulator complex
  • GO:1903865 sigma factor antagonist complex
  • GO:0000120 RNA polymerase I transcription regulator complex
  • GO:0090575 RNA polymerase II transcription regulator complex
  • GO:0032992 protein-carbohydrate complex
  • GO:0046806 viral scaffold
  • GO:0032300 mismatch repair complex
  • GO:0032993 protein-DNA complex
  • GO:0031588 nucleotide-activated protein kinase complex
  • GO:1990234 transferase complex
  • GO:0061695 transferase complex, transferring phosphorus-containing groups
  • GO:1902911 protein kinase complex
  • GO:0009365 protein histidine kinase complex
  • GO:1902554 serine/threonine protein kinase complex
  • GO:0000307 cyclin-dependent protein kinase holoenzyme complex
  • GO:1902554 nuclear cyclin-dependent protein kinase holoenzyme complex
  • GO:1902493 acetyltransferase complex
  • GO:0031248 protein acetyltransferase complex
  • GO:0031501 mannosyltransferase complex
  • GO:0106068 SUMO ligase complex
  • GO:1902503 adenylyltransferase complex
  • GO:0042575 DNA polymerase complex
  • GO:0030880 RNA polymerase complex
  • GO:0005965 protein farnesyltransferase complex
  • GO:0034708 methyltransferase complex
  • GO:1990228 sulfurtransferase complex
  • GO:0016459 myosin
  • GO:0045298 tubulin
  • GO:0031941 F-actin
  • GO:0005942 PI3K
  • GO:0000786 nucleosome
  • GO:0038201 mTOR
  • GO:0034360 chylomicron remnant
  • GO:0071256 translocon
  • GO:1990425 ryanodine receptor
  • GO:0034270 cytoplasm to vacuole targeting
  • GO:0106003 beta-amyloid

For information on Annodoc, see http://spyysalo.github.io/annodoc/.