Enrichment Map Logo

Enrichment Map User Guide

Overview

The Enrichment Map Cytoscape Plugin allows you to visualize the results of gene-set enrichment as a network. It will operate on any generic enrichment results as well as specifically on Gene Set Enrichment Analysis (GSEA) results. Nodes represent gene-sets and edges represent mutual overlap; in this way, highly redundant gene-sets are grouped together as clusters, dramatically improving the capability to navigate and interpret enrichment results.

Gene-set enrichment is a data analysis technique taking as input

  1. a (ranked) gene list, from a genomic experiment

  2. gene-sets, grouping genes on the basis of a-priori knowledge (e.g. Gene Ontology) or experimental data (e.g. co-expression modules)

and generating as output the list of enriched gene-sets, i.e. best sets that summarizing the gene-list. It is common to refer to gene-set enrichment as functional enrichment because functional categories (e.g. Gene Ontology) are commonly used as gene-sets.

EM_example_2.png


Installation

The Enrichment Map Plugin requires Cytoscape Version 2.6.x. If you don't have Cytoscape or an older Version (2.5 or older), please download the latest Release from http://www.cytoscape.org/ and install it on your computer.


Quick Start Guide

Creating an Enrichment Map

You have a few different options:

The only difference between the above modes is the structure of the enrichment table(s). In either case, to use the plugin you will need the following files:

(*) GSEA saves the enrichment table as a .xls file; however, these are not true Excel files, they are tab-separated text files with a modified extension; Enrichment Map does not work with "true" Excel .xls files.

If your enrichment results were generated from GSEA, you will just have to pick the right files from your results folder. If you have generated the enrichment results using another method, you will have to go to the Full User Guide, File Format section, and make sure that the file format complies with Enrichment Map requirements.

You can use the parameter defaults. For a more careful choice of the parameter settings, please go to the Full User Guide, Tips on Parameter Choice.

Graphical Mapping of Enrichment

Exploring the Enrichment Map

Advanced tips


Full User Guide

File Formats

Gene sets file (GMT file)

Expression Data file (GCT, TXT or RNK file) [OPTIONAL]

GCT (GSEA file type)

RNK (GSEA file type)

Additional Information on GSEA File Formats can be found here

TXT

Enrichment Results files

GSEA result files

Additional Information on GSEA File Formats can be found here

Generic results files

Notes:

  1. description and FDR columns can have empty or NA values, but the column and the column header must exist
  2. if no value is provided under phenotype, Enrichment Map will assume there is only one phenotype, and will map enrichment p-values to red

See here for examples

DAVID Enrichment Result File

Notes:

  1. The DAVID option expects a file as generated by the DAVID web interface.
  2. DISCLAIMER : In the absence of a gmt gene sets are constructed based on the field Genes in the DAVID output. This only considers the genes entered in your query set and not the genes in your background set. This will drastically affect the amount of overlap you see in the resulting Enrichment Map.

See here for tutorial on how to generate David output files for Enrichment maps

BiNGO Enrichment Result File

Notes:

  1. The BiNGO option expects a file as generated by the BiNGO Cytsocape Plugin.
  2. DISCLAIMER : In the absence of a gmt gene sets are constructed based on the field Genes in the BiNGO output. This only considers the genes entered in your query set and not the genes in your background set. This will drastically affect the amount of overlap you see in the resulting Enrichment Map.

See here for tutorial on how to generate Bingo output files for Enrichment maps

RPT files

                '''producer_class'''    xtools.gsea.Gsea
                '''producer_timestamp'''        1367261057110
                param   collapse        false
                param   '''cls'''       WHOLE_PATH_TO_FILE/EM_EstrogenMCF7_TestData/ES_NT.cls#ES24_versus_NT24
                param   plot_top_x      20
                param   norm    meandiv
                param   save_rnd_lists  false
                param   median  false
                param   num     100
                param   scoring_scheme  weighted
                param   make_sets       true
                param   mode    Max_probe
                param   '''gmx'''       WHOLE_PATH_TO_FILE/EM_EstrogenMCF7_TestData/Human_GO_AllPathways_no_GO_iea_April_15_2013_symbol.gmt
                param   gui     false
                param   metric  Signal2Noise
                param   '''rpt_label''' ES24vsNT24
                param   help    false
                param   order   descending
                param   '''out'''       WHOLE_PATH_TO_FILE/EM_EstrogenMCF7_TestData
                param   permute gene_set
                param   rnd_type        no_balance
                param   set_min 15
                param   include_only_symbols    true
                param   sort    real
                param   rnd_seed        timestamp
                param   nperm   1000
                param   zip_report      false
                param   set_max 500
                param   '''res'''       WHOLE_PATH_TO_FILE/EM_EstrogenMCF7_TestData/MCF7_ExprMx_v2_names.gct

                file    WHOLE_PATH_TO_FILE/EM_EstrogenMCF7_TestData/ES24vsNT24.Gsea.1367261057110/index.html

* Parameters used by EM:

  1. producer_class - can be xtools.gsea.Gsea or xtools.gsea.GseaPreranked

    • if xtools.gsea.Gsea:

      • get expression file from res parameter in rpt
      • get phenotype information from cls parameter in rot
    • if xtools.gsea.GseaPreranked:

      • No expression file
      • use rnk as the expression file from rnk parameter in rot
      • set phenotypes to na_pos and na_neg.
      • NOTE: if you want to make using an rpt file easier for GSEAPreranked there are two additional parameters you can add to your rpt file manually that the rpt function will recognize.

To do less manual work while creating Enrichment Maps from pre-ranked GSEA, add the following optional parameters:
        param(--tab--)phenotypes(--tab--){phenotype1}_versus_{phenotype2}
        param(--tab--)expressionMatrix(--tab--){path_to_GCT_or_TXT_formated_expression_matrix}
  1. producer_timestamp - needed to find the directory with the results files
  2. cls - path to class/phenotype file with information regarding the phenotypes:
    • path/classfilename.cls#phenotype1_versus_phenotype2
    • EM get the path to the class file and also pulls the phenotype1 and phenotype2 from the above field
  3. gmx - path to gmt file
  4. rpt_label - name of analysis and name of directory that GSEA creates to hold the results. Used when constructing the path to the results directory.
  5. out - path to directory where GSEA will put the output directory. Used when constructing the path to the results directory.
  6. res - path to expression file.

rpt Searches for the following results files:

        enrichment File 1 --> {out}(--File.separator--){rpt_label} + "." + {producer_class} + "." + {producer_timestamp}(--File.separator--) "gsea_report_for_" + phenotype1 + "_" + timestamp + ".xls"
        enrichment File 2 --> {out}(--File.separator--){rpt_label} + "." + {producer_class} + "." + {producer_timestamp}(--File.separator--) "gsea_report_for_" + phenotype2 + "_" + timestamp + ".xls"
        ranks File --> {out}(--File.separator--){rpt_label} + "." + {producer_class} + "." + {producer_timestamp}(--File.separator--) "ranked_gene_list_" + phenotype1 + "_versus_" + phenotype2 +"_" + timestamp + ".xls";
  • If the enrichments and rank files are not found in the above path then EM replaces the out directory with the path to the given rpt file and tries again.
  • if you would like to create your own rpt file for your own analysis pipeline you can put your own values for the above used parameters.
  • If your analysis only creates one enrichment file you can make a copy of enrichment file 1 in the path of enrichment file 2 with no consequences for EM running.

EDB File (GSEA file type)

  • Contained in the GSEA results folder is an edb folder. In the edb folder there are the following files:
    1. results.edb
    2. gene_sets.gmt
    3. classfile.cls [Only in a GSEA analysis. Not in a GSEAPreranked analysis]
    4. rankfile.rnk
  • If you specify the results.edb file in any of the Fields under the dataset tab (Expression, Enrichment Results 1 or Enrichment Results 2) the gmt and enrichment files fields will be automatically populated.
  • If you want to associate an expression file with the analysis it needs to be loaded manually as described here.

  * NOTE: The gene_sets.gmt file contained in the edb directory is filtered according to the expression file.  If you are doing a two dataset analysis where the expression files are from different platforms or contain different sets of genes the edb gene_sets.gmt file can not be used as genes found in one analysis might be lacking in the other.  In this case use the original gmt file (prior to GSEA filtering) and EM will filter each the gene sets separately according to each dataset.

Advanced Settings - Additional Files

Parameters

Node (Gene Set inclusion) Parameters

  • Node specific parameters filter the gene sets included in the enrichment map
  • For a gene set to be included in the enrichment map it needs to pass both p-value and q-value thresholds.

P-value

  • All gene sets with a p-value with the specified threshold or below are included in the map.

FDR Q-value

  • All gene sets with a q-value with the specified threshold or below are included in the map.
  • Depending on the type of analysis the FDR Q-value used for filtering genesets by EM is different
    • For GSEA the FDR Q-value used is 8th column in the gsea_results file and is called "FDR q-val".
    • For Generic the FDR Q-value used is 4th column in the generic results file.
    • For David the FDR Q-value used is 12th column in the david results file and is called "Benjamini".
    • For Bingo the FDR Q-value used is 3rd column in the Bingo results file and is called "core p-value"

Edge (Gene Set relationship) Parameters

  • An edge represents the degree of gene overlap that exists between two gene sets, A and B.
  • Edge specific parameters control the number of edges that are created in the enrichment map.
  • Only one coefficient type can be chosen to filter the edges

Jaccard Coefficient

                Jaccard Coefficient = [size of (A intersect B)] / [size of (A union B)]

Overlap Coefficient

                Overlap Coefficient = [size of (A intersect B)] / [size of (minimum( A , B))]

Combined Coefficient

  • the combined coefficient is a merged version of the jacquard and overlap coefficients.
  • the combined constant allows the user to modulate reciprocally the weights associated with the jacquard and overlap coefficients.
  • When k = 0.5 the combined coefficient is the average between the jacquard and overlap.

                Jaccard Coefficient = [size of (A intersect B)] / [size of (A union B)]
                Overlap Coefficient = [size of (A intersect B)] / [size of (minimum( A , B))]
                
                Combined Constant = k

                Combined Coefficient = (k * Overlap) + ((1-k) * Jaccard)

Tips on Parameter Choice

P-value and FDR Thresholds

GSEA can be used with two different significance estimation settings: gene-set permutation and phenotype permutation. Gene-set permutation was used for Enrichment Map application examples.

Gene-set Permutation

Here are different sets of thresholds you may consider for gene-set permutation:

  • Very permissive:
    • p-value < 0.05

    • FDR < 0.25

  • Moderately permissive:
    • p-value < 0.01

    • FDR < 0.1

  • Moderately conservative:
    • p-value < 0.005

    • FDR < 0.075

  • Conservative:
    • p-value < 0.001

    • FDR < 0.05

For high quality, high coverage transcriptomic data, the number of enriched terms at the very conservative threshold is usually 100-250 when using gene-set permutation.

Phenotype Permutation

  • Recommended:
    • p-value < 0.05

    • FDR < 0.25

In general, we recommend to use permissive thresholds only if your having a hard time finding any enriched terms.

Jaccard vs. Overlap Coefficient

  • The Overlap Coefficient is recommended when relations are expected to occur between large-size and small-size gene-sets, as in the case of the Gene Ontology.
  • The Jaccard Coefficient is recommended in the opposite case.
  • When the gene-sets are about the same size, Jaccard is about the half of the Overlap Coefficient for gene-set pairs with a small intersection, whereas it is about the same as the Overlap Coefficient for gene-sets with large intersections.
  • When using the Overlap Coefficient and the generated map has several large gene-sets excessively connected to many other gene-sets, we recommend switching to the Jaccard Coefficient.

Overlap Thresholds

  • 0.5 is moderately conservative, and is recommended for most of the analyses.
  • 0.3 is permissive, and might result in a messier map.

Jaccard Thresholds

  • 0.5 is very conservative
  • 0.25 is moderately conservative

Interfaces

The Input Panel

Screenshot EnrichmentMap InputPanel

  1. Analysis Type

    • There are two distinct types of Enrichment map analyses, GSEA or Generic.
      • GSEA - takes as inputs the output files created in a GSEA analysis. File formats are specific to files created by GSEA. The main difference between this and generic is the number and format of the Enrichment results files. GSEA analysis always has two enrichment results files, one for each of the phenotypes compared.

      • Generic - takes as inputs the same file formats as a GSEA analysis except the Enrichment results file is a different format and there is only one enrichment file. Generic File description

      • DAVID - (implemented in v1.0 and higher) has no gmt or expression file requirement and takes as input enrichment result file as produced by DAVID David Enrichment Result File description

  2. Genesets - path to gmt file describing genesets. User can browse hard drive to find file by pressing ... button.

  3. Dataset 1 - User can specify expression and enrichment files or alternatively, an rpt file which will populate all the fields in genesets,dataset # and advanced sections.

  4. Advanced - Initially collapsed (expand by clicking on arrow head directly next to Advanced), users have the option of modifying the phenotype labels or loading gene rank files.

  5. Parameters - User can specify p-value, fdr and overlap/jaccard cutoffs. Choosing Optimal parameter values

  6. Actions - The user has three choices, Reset (clears input panel), Close (closes input panel), and Build Enrichment map (takes all parameters in panel and builds an Enrichment map)

The Data Panel

  • The bottom (south) panel.

Expression Viewer

  • There are two different types of Expression Viewers, each is represented as a separate tab in data panel:
    • EM Overlap - shows the expression of genes in the overlap (intersection) of all the genesets selected
    • EM Geneset - shows the expression of genes of the union of all the genesets selected.
  • Features of the Expression Viewer include:
    • Normalization
      • Data as is - represents the data as it was loaded
      • Row Normalize Data - for each value in a row of expression the mean of the row is subtracted followed by division by the row's standard deviation.
      • Log Transform Data - takes the log of each expression value
    • Sorting
      • Hierarchical cluster - as computed using Pearson correlation of the entire expression set.
      • If rank files for the data sets are provided at input they will show up as 'Dataset 1 Ranking' and 'Dataset 2 Ranking' and by selecting them the user will be able to sort the expression accordingly
        • if an expression value does not have a corresponding rank in the ranking file its expression does not appear in the heatmap.
      • Add Ranking ... - allows user to upload an additional rank file (in the appropriate format,as outlined in Rank file descriptions). There is no limit on the number of rank files that are uploaded. The user is required to give a name to the rank file.

    • Save Expression Set
      • The user can save the subset of expression values currently being viewed in the expression viewer as txt file.

Node Attributes

  • For each Enrichment map created the following attributes are created for each node:
    • EM#_Name - the gene set name
    • EM#_Formatted_name - a wrapped version of the gene set name so it is easy to visualize.

      • Note: This is the default label of the node but some users find it easier to arrange the network when the name is not wrapped. If this is the case in the vizmapper the user can switch the label mapping from EM#_formatted_name to EM#_name.

    • EM#_GS_DESCR - the gene set description (as specified in the second column of the gmt file)
    • EM#_Genes - the list of genes that are part of this gene set.
  • Additionally there are attributes created for each dataset (a different set for each dataset if using two dataset mode):
    • EM#_pvalue_dataset(1 or 2) - Gene set p-value, as specified in GSEA enrichment result file.
    • EM#_qvalue_dataset(1 or 2) - Gene set q-value, as specified in GSEA enrichment result file.
    • EM#_Colouring_dataset(1 or 2) - Enrichment map parameter calculated using the formula 1-pvalue multiplied by the sign of the ES score (if using GSEA mode) or the phenotype (if using the Generic mode)
    • GSEA specific attributes (these attributes are not populated when creating an enrichment map using the generic mode)
      • EM#_ES_dataset(1 or 2) - Enrichment score, as specified in GSEA enrichment result file.
      • EM#_NS_dataset(1 or 2) - Normalized Enrichment score, as specified in GSEA enrichment result file.
      • EM#_fwer_dataset(1 or 2) - Family-wise error score, as specified in GSEA enrichment result file.

Edge Attributes

  • For each Enrichment map created the following attributes are created for each edge:
    • EM#_Overlap_size - the number of genes associated with the overlap of the two genesets that this edge connects.
    • EM#_Overlap_genes - the names of the genes that are associated with the overlap of the two genesets that this edge connects.
    • EM#_similarity_coefficient - the calculated coefficient for this edge.

The Results Panel

  • The right (east) panel

Parameters pane

  • Reference panel containing legends, slider bars for the user to modify p-value and q-value cut-offs, parameters used for the analysis

PostAnalysis Input Panel

To access the post-analysis, follow the path: Menu: Plugin / Enrichment Map / Post Analysis.

Screenshot PostAnalysis InputPanel SignatureHubs

  1. Post Analysis Type

    • Currently there is only one Type of Post Analysis available:
    • Signature Hubs - calculates the overlap between genesets of the current Enrichment Map and a number of selected external genesets.

  2. Gene Sets

    • The user needs to supply two geneset-files (both in the gmt format):
    • GMT - Enrichment Genesets; the same geneset gmt file as used to create the Enrichment Map (this field will be usually already populated)

    • SigGMT - the gmt file with the Signature-Genesets

  3. Load Genesets should be pressed after the file with the Signature-Genesets has been selected. This will populate the list of available Signature Genesets.

  4. Available Signature Genesets – Once the genesets are loaded, this box will contain a list of all genesets defined in the SigGMT file. Click to highlight the desired geneset(s).

    • To highlight more than one geneset at a time, the user can click while pressing the [SHIFT]-, [COMMAND]- or [CTRL]-keys (depending on the Operating System).
  5. Selected Signature Genesets – The Signature Hub analysis will be performed with all genesets in this list. The user can use the down- and up-buttons to move highlighted genesets from one list to the other.

  6. Parameters – The User can choose a method and a cutoff for generating an edge between a signature-geneset and an enrichment geneset. The following methods are available:

    • Hypergeometric Test is the probability (p-value) to find an overlap of k or more genes between a signature geneset and an enrichment geneset by chance.

      • Formular Hypergeometric Test
        with:
        k (successes in the sample) : size of the Overlap,
        n (size of the sample) : size of the Signature geneset
        m (total number of successes) : size of the Enrichment Geneset
        N (total number of elements) : size of the union of all Enrichment Genesets

    • Number of common Genes

    • directed Overlap is the fraction of the intersection of both genesets in respect to the Enrichment Geneset.

  7. Actions - The user has three choices, Reset (clears input panel), Close (closes input panel), and Run (takes all parameters in panel and performs the Post-Analysis)

The post-analysis p-values can be accessed by: select the following attribute for display: "EM1_Overlap_Hypergeom_pVal" (Data Panel: Edge Attribute Browser tab, attribute selection icon)

Additional Features

Launch Enrichment Map from the command line

  • Requirements:
    1. Enrichment Map v1.3 or higher
    2. Commandtool - available from Cytsocape App store

Distinct Species or Platform Analysis

Bulk Enrichment Map Build

Calculate Gene set relationships

GSEA Leading Edge Functionality

  • For every gene set that is tested for significance using GSEA there is a set of proteins in that gene set defined as the Leading Edge. According to GSEA the leading edge is:

"the subset of members that contribute most to the ES. For a positive ES, the leading edge subset is the set of members that appear in the ranked list prior to the peak score. For a negative ES, it is the set of members that appear subsequent to the peak score."
  • In essence, the leading edge is the set of genes that contribute most to the enrichment of the gene set.
  • For Enrichment Map, leading edge information is extracted from the gsea enrichment results files from the column denoted as Rank at Max. Rank at max is the rank of the gene where the ES score has the maximal value, i.e. the peak ES score. Everything with a better rank than the rank at max is part of the leading edge set.

Customizing Defaults with Cytoscape Properties

The Enrichment Map Plugin evaluates a number of Cytoscape Properties with which a user can define some customized default values.
These can be added and changed with the Cytoscape Preferences Editor (Edit / Preferences / Properties...) or by directly editing the file cytoscape.props within the .cytoscape folder in the User's HOME directory.

Supported Cytoscape Properties are:

EnrichmentMap.default_pvalue
Default P-value cutoff for Building Enrichment Maps

Default Value: 0.05

valid Values: float >0.0, <1.0

EnrichmentMap.default_qvalue
Default Q-value cutoff for Building Enrichment Maps
Default Value: 0.25

valid Values: float >0.0, <1.0

EnrichmentMap.default_overlap
Default Overlap coefficient cutoff for Building Enrichment Maps
Default Value: 0.50

valid Values: float >0.0, <1.0

EnrichmentMap.default_jaccard
Default Jaccard coefficient cutoff for Building Enrichment Maps
Default Value: 0.25

valid Values: float >0.0, <1.0

EnrichmentMap.default_overlap_metric
Default choice of similarity metric for Building Enrichment Maps

Default Value: Jaccard

valid Values: Jaccard, Overlap

EnrichmentMap.default_sort_method
Set the default sorting in the legend/parameters panel to Hierarchical Clustering,
  • Ranks (default the first rank file, if no ranks then it is no sort), Column (default is the first column) or no sort.

Default Value: Hierarchical Cluster

valid Values: Hierarchical Cluster, Ranks, Columns, No Sort

EnrichmentMap.hieracical_clusteting_theshold
Threshold for the maximum number of Genes before a dialogue opens to confirm if clustering should be performed.
Default Value: 1000
valid Values: Integer
nodelinkouturl.MSigDb.GSEA Gene sets

LinkOut URL for MSigDb.GESA Gene sets.

Default Value: http://www.broad.mit.edu/gsea/msigdb/cards/%ID%.html

valid Values: URL
EnrichmentMap.disable_heatmap_autofocus
Flag to override the automatic focus on the Heatmap once a Node or Edge is selected.

Default Value: FALSE

valid Values: TRUE, FALSE

FAQ

Software/EnrichmentMap/UserManual (last edited 2013-06-19 18:30:28 by RuthIsserlin)

MoinMoin Appliance - Powered by TurnKey Linux