Size: 2302
Comment:
|
Size: 3358
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 5: | Line 5: |
DW Resource Loading: Entrez: ---------------------------- |
=== DW Resource Loading: Entrez: === |
Line 10: | Line 10: |
1) Flat file format: This is composed of 11 main flat files, in a tab delimited format. The files are: gene2accession gene2go gene2pubmed gene2refseq gene2sts gene2unigene gene_history gene_info mim2gene gene_refseq_uniprotkb_collab interactions |
1. Flat file format: This is composed of 11 main flat files, in a tab delimited format. The files are: {{{ gene2accession, gene2go, gene2pubmed, gene2refseq, gene2sts, gene2unigene, gene_history, gene_info, mim2gene, gene_refseq_uniprotkb_collab, interactions. }}} |
Line 25: | Line 27: |
Line 26: | Line 29: |
Line 29: | Line 33: |
2) ASN.1 Binary format: This compressed format can be transformed to XML using a program associated with the Entrez release as well. The structure of the bulky XML produced is based on DTDs by NCBI, and not on an XML schema. It is mainly split on a per species (or class) basis, but there are all inclusive files as well. It can be compared to the gene_info table listed above. | |
Line 31: | Line 34: |
3) There are other released files as well, that are of less significance or represent different subsets for the data listed above, like releasing the interactions for HIV as a separate file, and so on. | 2. ASN.1 Binary format: This compressed format can be transformed to XML using a program associated with the Entrez release as well. The structure of the bulky XML produced is based on DTDs by NCBI, and not on an XML schema. It is mainly split on a per species (or class) basis, but there are all inclusive files as well. It is comparable to the gene_info table listed above. |
Line 33: | Line 36: |
4) For more information, visit the Entrez FTP site at: ftp://ftp.ncbi.nlm.nih.gov/gene/ | 3. There are other released files that are of less significance or represent different subsets for the data listed above, like releasing the interactions for HIV as a separate file, and so on. 4. For more information, visit the Entrez FTP site at: ftp://ftp.ncbi.nlm.nih.gov/gene/ === Identifier Mapping Tables: === 1. Six tab delimited files (can be imported into Excel or a similar package, as well as parsed). Each file belongs to one of the species of interest, except for A.Thaliana. Files will be named based on their species. 2. Each sheet will have the following columns: (example from human) {{{ Ensembl Gene ID: ENSG00000198888 Ensembl Gene Symbol: ND1 Entrez Gene ID: 4535 Ensembl Description: etc... }}} 3. First row in each file will be the headers row. 4. In general, when there is no data available for a specific cell in a particular column, the term 'NA' (or a similar standard term) will be used instead. In cases when there is more than one entry per cell, the entries will be separated by a ';'. 5. As a general rule for any importing system, its better to treat IDs as alphaneumeric in type rather than neumeric, since they can be either, depending on the source database referenced. === More DW Documents: === * [:../GeneManiaDataCollection: GeneMANIA Data Collection] |
GeneMania DataWarehouse (DW) Related Documentation
DW Resource Loading: Entrez:
- The Entrez database, by NCBI, is released in two main formats.
- Flat file format: This is composed of 11 main flat files, in a tab delimited format. The files are:
gene2accession, gene2go, gene2pubmed, gene2refseq, gene2sts, gene2unigene, gene_history, gene_info, mim2gene, gene_refseq_uniprotkb_collab, interactions.
Most of these files are good for cross referencing and matching identifiers between different databases within NCBI and elsewhere. For example, matching a gene ID to the appropriate RNA/protein sequence IDs (gene2accession), to the published journal references (gene2pubmed), or to the associated human genetic diseases (mim2gene). Some of these files, however, have more meat in them, like gene2go (matching genes with GO ontologies), gene_info, and interactions (lists interactions with BIND, BioGrid, EcoCyc, HPRD). The local Entrez mirror is currently based on this format. The table/column names were purposefully matched to the file/header names for ease of use (except in cases where this might cause technical hassle, like having dots or spaces in column names). Note that there are no null columns in these tables. A hyphen '-' (and sometimes a '?') is usually used by the source files, instead. The advantage of this format is ease of use, and the fact that the files are inclusive of all species information available from NCBI. Local views for the subsets of interest can be created as well.
- ASN.1 Binary format: This compressed format can be transformed to XML using a program associated with the Entrez release as well. The structure of the bulky XML produced is based on DTDs by NCBI, and not on an XML schema. It is mainly split on a per species (or class) basis, but there are all inclusive files as well. It is comparable to the gene_info table listed above.
- There are other released files that are of less significance or represent different subsets for the data listed above, like releasing the interactions for HIV as a separate file, and so on.
For more information, visit the Entrez FTP site at: ftp://ftp.ncbi.nlm.nih.gov/gene/
Identifier Mapping Tables:
1. Six tab delimited files (can be imported into Excel or a similar package, as well as parsed). Each file belongs to one of the species of interest, except for A.Thaliana. Files will be named based on their species.
2. Each sheet will have the following columns: (example from human)
Ensembl Gene ID: ENSG00000198888 Ensembl Gene Symbol: ND1 Entrez Gene ID: 4535 Ensembl Description: etc...
3. First row in each file will be the headers row.
4. In general, when there is no data available for a specific cell in a particular column, the term 'NA' (or a similar standard term) will be used instead. In cases when there is more than one entry per cell, the entries will be separated by a ';'.
5. As a general rule for any importing system, its better to treat IDs as alphaneumeric in type rather than neumeric, since they can be either, depending on the source database referenced.
More DW Documents:
[:../GeneManiaDataCollection: GeneMANIA Data Collection]