Gene sets are collections of genes that are grouped together based on higher level function or system characteristics, such as being part of the same molecular process or found to be co-regulated for example. Assessing gene sets in cBioPortal is useful when the user wants to visualize the number of mutations in sets of genes, or wants to see if all genes in a set are up- or down-regulated. To visualize gene set variation in a sample, the user can calculate scores per gene set per sample using the Gene Set Variation Analysis (GSVA) algorithm (Hänzelmann, 2013).
Before loading a study with gene set data, gene set definitions have to be added to the database. These can be custom user-defined sets, or sets downloaded from external sources such as MSigDB. Additionally, a gene set hierarchy can be imported which is used on the cBioPortal Query page for selecting gene sets.
This example shows how the process of importing gene set data using test data.
Navigate to scripts folder:
Import gene sets and supplementary data:
Note: This removes existing gene set, gene set hierarchy and gene set genetic profile data.
./importGenesetData.pl \--data ../../test/resources/genesets/study_es_0_genesets.gmt \--new-version msigdb_6.1 \--supp ../../test/resources/genesets/study_es_0_supp-genesets.txt
Import gene set hierarchy data:
./importGenesetHierarchy.pl \--data ../../test/resources/genesets/study_es_0_tree.yaml
Restart Tomcat if you have it running.
Import study (replace argument after
-u with local cBioPortal and
-html with preferred location for html report):
./importer/metaImport.py \-s ../../test/scripts/test_data/study_es_0 \-u http://cbioportal:8080/cbioportal \-html report_study_es_0.html \-v \-o
Gene set functionality was added in cBioPortal 1.7.0. Please use this or a later version. In addition, the database has to be updated to version 2.3.0 or higher, depending on the cBioPortal version. This can be done by running the python wrapper
Updating the database is described here.
Once you have initialized MySQL with cBioPortal database, it is possible to import gene sets. The format of the gene set data file is the Gene Matrix Transposed file format (.gmt). This format is also used by the MSigDB, which hosts several collections of gene sets on: https://software.broadinstitute.org/gsea/msigdb/
Sample of .gmt file:
GMT files contain a row for every gene set. The first column contains the EXTERNAL_ID or
stable id (MsigDB calls this "standard name"), e.g. GO_POTASSIUM_ION_TRANSPORT, not longer than 100 characters. The second column contains the REF_LINK. This is an optional URL linking to external information about this gene set. Column 3 to N contain the Entrez gene IDs that belong to this gene set.
Additional information can be placed in a supplementary file. This file should be a .txt, containing columns for the
stable id, the long name (max 100 characters) and description of the gene set (max 300 characters).
Sample of supplementary .txt file:
GLI1_UP.V1_DN<TAB>GLI1 upregulated v1 down genes<TAB>Genes down-regulated in RK3E cells (kidney epithelium) over-expressing GLI1 [GeneID=2735].GLI1_UP.V1_UP<TAB>GLI1 upregulated v1 up genes<TAB>Genes up-regulated in RK3E cells (kidney epithelium) over-expressing GLI1 [GeneID=2735].E2F1_UP.V1_DN<TAB>E2F1 upregulated v1 down genes<TAB>Identification of E2F1-regulated genes that modulate the transition from quiescence into DNA synthesis, or have roles in apoptosis, signal transduction, membrane biology, and transcription repression.
The importer for gene sets can be run with a perl wrapper, which is located at the following location and requires the following arguments:
cd <cbioportal_source_folder>/core/src/main/scriptsperl importGenesetData.plrequired: --data <data_file.gmt>--new-version <Version> OR --update-infooptional: --supp <supp_file.txt>
--new-version argument with a
<Version> parameter is used for loading new gene set definitions. It is not possible to add new gene sets or change the genes of current gene sets, without removing the old gene sets first. This is to prevent the user from having gene sets from different definitions and data from older definitions. The user can choose the name or number of the
<Version> as he likes, e.g.
Oncogenic_2017. Running the script with
--new-version removes all previous gene sets, gene set hierarchy and gene set genetic profiles. A prompt is given to make sure the user wants to do this. Note that it is possible enter the same version as the previous version, but previous data is removed nevertheless.
--update info can be used only to update only the long name, description and reference URL.
After importing gene sets, you can import a gene set hierarchy that is used on the query page to select gene sets.
For gene set hierarchy files, we use the YAML format. This is common format to structure hierarchical data.
Sample of format (note this is mock data):
Custom:Gene sets:- BCAT.100_UP.V1_DNCancer Gene sets from Broad:Gene sets:- AKT_UP.V1_DN- CYCLIN_D1_KE_.V1_UPBroad Subcategory 1:Gene sets:- HINATA_NFKB_MATRIXBroad Subcategory 2:Gene sets:- GLI1_UP.V1_UP- GLI1_UP.V1_DN- CYCLIN_D1_KE_.V1_UP
To make your own hierarchy, make sure every branchname ends with
:. Every branch can contain new branches (which can be considered subcategories) or gene sets (which are designated by the
Gene sets: statement). The gene set names are the
stable ids imported by
ImportGenesetData.java and should start with
cd <cbioportal_source_folder>/core/src/main/scriptsperl importGenesetHierarchy.plrequired: --data <data_file.yaml>
Gene set data can be added to a study folder and subsequently import the whole study with
metaImport.py. cBioPortal supports GSVA Scores and p-values (from bootstrapping) calculated using Gene Set Variation Analysis (GSVA, Hänzelmann, 2013). A description of GSVA study data can be found in the cBioPortal File Formats documentation.
GSVA: gene set variation analysis for microarray and RNA-Seq data Sonja Hänzelmann, Robert Castelo and Justin Guinney, BMC Bioinformatics, 2013 https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-7 https://www.bioconductor.org/packages/release/bioc/html/GSVA.html
Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles Aravind Subramanian, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, Scott L. Pomeroy, Todd R. Golub, Eric S. Lander, and Jill P. Mesirov, PNAS, 2005 https://www.pnas.org/content/102/43/15545 https://software.broadinstitute.org/gsea/msigdb