GeneMania DataWarehouse (DW)

Identifier Mapping Tables:

1. Eight tab delimited files (can be imported into Excel or a similar package, as well as parsed). The files represent IDs from the following species: S. cerevisiae, C. elegans, A. thaliana, R. Norvegicus, M. musculus, H. sapiens, D. Melanogaster, E. Coli.

2. For an Ensembl-based sheet, we will have the following columns: (example from human)

GMID: 3425 (add a 'GM' suffix: GM3425, if you prefer that, required column).
Ensembl Gene ID: ENSG00000198888
Protein Coding: True/False (cannot be 'N/A')
Gene Name: ND1
Ensembl Transcript ID: ENST00000361390
Ensembl Protein ID: ENSP00000354687
Uniprot ID: P03886
Entrez Gene ID: 4535
RefSeq mRNA ID: N/A (but could have been something like NM_001088)
RefSeq Protein ID: AP_000639
Synonyms: MTND1; NAD1;ND1
Definition: NADH-ubiquinone oxidoreductase chain 1 (EC 1.6.5.3) (NADH dehydrogenase subunit 1). Source: Uniprot/SWISSPROT P03886

3. For an Entrez-based sheet, we will have the following columns: (example from Cress)

GMID: 144382
Entrez Gene ID: 2745418
Protein Coding: True
Gene Name: AT2G01175
Uniprot ID: Q3EC92
TAIR Locus ID: AT2G01175
RefSeq mRNA ID: NM_201659
RefSeq Protein ID: NP_973388
Synonyms: N/A
Definition: hypothetical protein

4. First row in each file will be the headers row.

5. In general, when there is no data available for a specific cell in a particular column, the term 'N/A' (or a similar standard term) will be used instead. In cases when there is more than one entry per cell, the entries will be separated by a ';'.

6. As a general rule for any importing system, its better to treat IDs as alphaneumeric in type rather than neumeric, since they can be either, depending on the source database referenced.

7. The GMID is an internal identifier that is unique per species, and is stable within a build/release of the IDMapping. It is not stable between different releases and is not unique across different species. The DW itself does not use or reference this ID.

8. The source of the ID Mapping information will be the first resource listed in each file. In other words, the reference point is Ensembl for all species, except for A.Thaliana and E.coli where the reference point is Entrez. Exceptions:

  1. Synonyms are accumulated from both resources, then filtered to provide a unique list of synonyms per gene. If the gene name from the secondary source is not listed as one of the synonyms of the primary source, it is added to the filtered list as a synonym as well.
  2. RefSeq info always comes from Entrez.

  3. LeftOver entries, as described in the ID validation process.

9. Synonyms are case-sensitive, so (example from human) ChM1L (for Ensembl ENSG00000000005) and CHM1L (for the matching Entrez 64102) are listed as two different synonyms.

10. Note that the version number for an identifier (RefSeq mRNA and protein IDs) is ignored/truncated. So, NM_001088.1 and NM_001088.2 will be listed as NM_001088 (once). This approach is followed in some other bioinformatics tools when version numbers are irrelevant.

11. Ensembl and Entrez are the resources used for all species, except for A.Thaliana (where the resources are Entrez and TAIR) and E. Coli (just Entrez).

12. For the Ensembl-based mapping files, the Uniprot IDs col may have the curated Uniprot/Swissport IDs, and not the Uniprot/TreEMBL IDs. It includes the Uniprot ID (aka entry name) and the Uniprot primary accession for a protein. For the Entrez-based mapping files, only the Uniprot primary accession is offered, but is inclusive of both Uniprot/Swissprot and Uniprot/TrEMBL.

13. For a mouse ID mapping table, there is an additional column representing MGIs.

Bonus ID Mapping

The term 'bonus' ID mapping refers to mapping tables that represent the opposite view of the ID mapping tables mentioned above (and that were part of the GM 'requirements'). The tables are based on a Resource2_Resource1 mapping, where the info is derived from Resource2 (with the exceptions mentioned earlier). For example, for an Entrez_Ensembl mapping, the table will have the following columns: Entrez Gene ID, Protein Coding, Entrez Gene Name, Uniprot ID, Ensembl Gene ID, RefSeq mRNA ID, RefSeq Protein ID, Synonyms, Definition. In the case of a TAIR_ENTREZ mapping, the table will have the following columns: TAIR Locus ID, Protein Coding, TAIR Locus Name, Uniprot ID, Entrez Gene ID, RefSeq mRNA ID, RefSeq Protein ID, Synonyms, Definitions. Same procedure is followed with these tables, except for the generation of GMIDs, which is disabled. These mappings can be used as a general reference, and are one of the side benefits of the flexible design adopted.

Linking IDs to Resources:

This section describes the hyperlinking of identifiers, from the ID mapping files, to their external resources. The IDs can be plugged into these URLs, as follows:

1. Ensembl

Ensembl Gene ID: http://www.ensembl.org/SpeciesName/geneview?gene=EnsemblGeneID
Ensembl Transcript ID: http://www.ensembl.org/SpeciesName/transview?transcript=TranscriptID
Ensembl Protein ID: http://www.ensembl.org/SpeciesName/protview?peptide=ProteinID

The SpeciesName can be any one of the following:

Hs:Homo_sapiens
Mm:Mus_musculus
Rn:Rattus_norvegicus
Dm:Drosophila_melanogaster
Sc:Saccharomyces_cerevisiae
Ce:Caenorhabditis_elegans

Examples:

http://www.ensembl.org/Homo_sapiens/geneview?gene=ENSG00000000003
http://www.ensembl.org/Homo_sapiens/transview?transcript=ENST00000342929
http://www.ensembl.org/Mus_musculus/protview?peptide=ENSMUSP00000045693

2. Entrez

Entrez Gene ID: http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=EntrezGeneID
RefSeq mRNA: http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=RefSeq_mRNA_ID
RefSeq Protein: http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=RefSeqProteinID

Examples:

http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=4232
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NM_002402
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NP_002393

3. TAIR

TAIR Locus ID: http://arabidopsis.org/servlets/TairObject?type=locus&name=TAIR_LocusID

Example:

http://arabidopsis.org/servlets/TairObject?type=locus&name=AT2G01175

4. Uniprot

Uniprot Primary Accession: http://www.uniprot.org/uniprot/UniprotAccession
Uniprot Entry name (aka Uniprot ID): http://www.uniprot.org/uniprot/UniprotID

Examples

http://www.uniprot.org/uniprot/Q5EB52
http://www.uniprot.org/uniprot/MEST_HUMAN (redirected to the primary accession URL)

5. Ensembl gene names, Entrez gene names, and TAIR locus names are all linked to the pages of the corresponding Ensembl gene ID, Entrez gene ID, and TAIR locus ID, respectively.

6. We currently save both Uniprot primary accessions and Uniprot IDs in the Ensembl-based ID mapping files, in the same column. However, linking to Uniprot by the Uniprot primary accession is better than linking by the Uniprot ID, since the former is a stable and unique identifier for a Uniprot entry, while the latter might change between different Uniprot releases.

GeneMania/IDMapping (last edited 2009-12-11 19:45:34 by RashadBadrawi)

MoinMoin Appliance - Powered by TurnKey Linux