GeneMania DataWarehouse (DW)

Identifier Validation:

  1. Shared Identifier: This is a general case, that covers all incidents where an identifier is shared between two or more genes. It applies to any identifier that is listed in the ID Mapping table, namely: gene name/symbol, Ensembl transcript ID, Ensembl protein ID, Entrez gene ID, Uniprot ID (irrespective of whether its a Uniprot ID or a Uniprot accession), RefSeq mRNA ID, RefSeq protein ID, and TAIR locus ID (for Entrez/TAIR). Possible solutions:

    1. Delete the shared gene symbol from the IDMapping file, for all of the affected genes, so it won't match any searching query to begin with. This is the solution of choice for now.
    2. Delete all the affected gene entries, but that would lead to the loss of useful (and reliable) information.
    3. Keep all. The GM front end and the GM engine should, if faced with such a use case, query the user for which gene are they specifically referring to.
    4. Since the genes that share the same symbol are likely to be very similar in features (?), keep that symbol for one, and delete it from all the others. Unfortunately, there is no automated method of deciding on which gene to select for this.
  2. LeftOver Gene: Generally speaking, the mapping between the genes from two resources (e.g. Ensembl and Entrez) is not perfect. So, there will be some 'left over' genes from the second resource (e.g. Entrez) that will have to be captured separately, and added to the respective ID Mapping file. For the Ensembl/Entrez example, the entries of that section of the file will be similar to the earlier ones, minus any Ensembl specific information (Ensembl Gene ID, Ensembl Transcript ID, Ensembl Protein ID). Needless to say, this validation step will require filtering the IDs from the two resources against each other, and filling in for the missing info when possible. For example, we will need to grab the gene definition line from Entrez, instead of Ensembl, for an Ensembl/Entrez leftover gene.

  3. Deprecated Identifier: Another reason for mismatches between two source is the use of old identifiers. Following the Ensembl/Entrez example, Ensembl might reference Entrez gene IDs that no more exist in Entrez. To avoid that, the ID validation modules should check for the presence of all Entrez gene IDs (in Entrez), and if missing, do the following:

    1. Check if the deprecated Entrez gene has been replaced by another one, or more than one, Entrez genes.
    2. If so, replace the deprecated Entrez ID with the new one(s), in the ID mapping table, together with the associated information.
    3. If not, drop the deprecated Entrez ID from the ID Mapping table. It should be reported into the IVReports as well.
    Currently, this issue applies to Entrez gene IDs only. Needless to say, this built-in mismatch problem will affect any ID cross-referencing process in the future.

Identifier Validation Report (IVReport):

Source: Ensembl
Species: Hs (9606)
Validation Type: LeftOver Gene
Identifier Type: N/A
Number of Occurences: 20164
Total Number: 36582

Source: Ensembl
Species: Hs (9606)
Validation Type: Deprecated Identifier
Identifier Type: N/A
Number of Occurences: 58
Total Number: 21566

Source: Ensembl
Species: Hs (9606)
Validation Type: Shared Identifier
Identifier Type: Ensembl Gene Name
Number of Occurences: 2601
Total Number: 31340

Source    Species       Gene ID
Ensembl    Hs (9606)    100008587
Ensembl    Hs (9606)    100008588

Source     Species      Gene ID             Old Gene ID    New Gene ID
Ensembl    Hs (9606)    ENSG00000034063    100133565         728688
Ensembl    Hs (9606)    ENSG00000070831    641992            N/A

Source: Ensembl
Species: Hs (9606)
Identifier Type: Uniprot ID
Identifier: RGPD7_HUMAN
Gene ID: ENSG00000015568;ENSG00000183054

Source: Ensembl
Species: Hs (9606)
Identifier Type: Ensembl Gene Name
Identifier: AL117336.22
Gene ID: ENSG00000200097;ENSG00000209753

GeneMania/IDValidation (last edited 2009-11-11 20:44:48 by RashadBadrawi)

MoinMoin Appliance - Powered by TurnKey Linux