cBioPortal
Search…
2.4 Integration with Other Webservices
5.2 Datasets
6. Web API and Clients
Powered By GitBook
File Formats

Introduction

This page describes the file formats that cancer study data should assume in order to be successfully imported into the database. Unless otherwise noted, all data files are in tabular-TSV (tab separated value) format and have an associated metadata file which is in a multiline record format. The metadata and data files should follow a few rules documented at the Data Loading page.

Formats

Cancer Study

As described in the Data Loading tool page, the following file is needed to describe the cancer study:

Meta file

This file contains metadata about the cancer study. The file contains the following fields:
    1.
    type_of_cancer: The cancer type abbreviation, e.g., "brca". This should be the same cancer type as specified in the meta_cancer_type.txt file, if available. The type can be "mixed" for studies with multiple cancer types.
    2.
    cancer_study_identifier: A string used to uniquely identify this cancer study within the database, e.g., "brca_joneslab_2013".
    3.
    name: The name of the cancer study, e.g., "Breast Cancer (Jones Lab 2013)".
    4.
    description: A description of the cancer study, e.g., "Comprehensive profiling of 103 breast cancer samples. Generated by the Jones Lab 2013". This description may contain one or more URLs to relevant information.
    5.
    citation (Optional): A relevant citation, e.g., "TCGA, Nature 2012".
    6.
    pmid (Optional): One or more relevant pubmed ids (comma separated without whitespace). If used, the field citation has to be filled, too.
    7.
    groups (Optional): When using an authenticating cBioPortal, lists the user-groups that are allowed access to this study. Multiple groups are separated with a semicolon ";". The study will be invisible to users not in at least one of the listed groups, as if it wasn't loaded at all. e.g., "PUBLIC;GDAC;SU2C-PI3K". see User-Authorization for more information on groups
    8.
    add_global_case_list (Optional): set to 'true' if you would like the "All samples" case list to be generated automatically for you. See also Case lists.
    9.
    tags_file (Optional): the file name containing custom study tags for the study tags.
    10.
    reference_genome (Optional): the study reference genome (e.g. hg19, hg38). Without specifying this property, the study will be assigned to the reference genome specified in portal.properties (property ucsc.build).

Example

An example meta_study.txt file would be:
1
type_of_cancer: brca
2
cancer_study_identifier: brca_joneslab_2013
3
name: Breast Cancer (Jones Lab 2013)
4
description: Comprehensive profiling of 103 breast cancer samples. Generated by the Jones Lab 2013.
5
add_global_case_list: true
Copied!

Cancer Type

If the type_of_cancer specified in the meta_study.txt does not yet exist in the type_of_cancer database table, a meta_cancer_type.txt file is also mandatory.

Meta file

The file is comprised of the following fields:
    1.
    genetic_alteration_type: CANCER_TYPE
    2.
    datatype: CANCER_TYPE
    3.
    data_filename: your datafile

Example

An example meta_cancer_type.txt file would be:
1
genetic_alteration_type: CANCER_TYPE
2
datatype: CANCER_TYPE
3
data_filename: cancer_type.txt
Copied!

Data file

The file is comprised of the following columns in the order specified:
    1.
    type_of_cancer: The cancer type abbreviation, e.g., "brca".
    2.
    name: The name of the cancer type, e.g., "Breast Invasive Carcinoma".
    3.
    dedicated_color: CSS color name of the color associated with this cancer study, e.g., "HotPink". See this list for supported names, and follow the awareness ribbons color schema. This color is associated with the cancer study on various web pages within the cBioPortal.
    4.
    parent_type_of_cancer: The type_of_cancer field of the cancer type of which this is a subtype, e.g., "Breast".
    : you can set parent to tissue, which is the reserved word to place the given cancer type at "root" level in the "studies oncotree" that will be generated in the homepage (aka query page) of the portal.

Example

An example record would be:
1
brca<TAB>Breast Invasive Carcinoma<TAB>HotPink<TAB>Breast
Copied!

Clinical Data

The clinical data is used to capture both clinical attributes and the mapping between patient and sample ids. The software supports multiple samples per patient.
As of March 2016, the clinical file is split into a patient clinical file and a sample clinical file. The sample file is required, whereas the patient file is optional. cBioPortal has specific functionality for a core set of patient and sample columns, but can also display custom columns (see section "Custom columns in clinical data").

Meta files

The two clinical metadata files (or just one metadata file if you choose to leave the patient file out) have to contain the following fields:
    1.
    cancer_study_identifier: same value specified in meta_study.txt
    2.
    genetic_alteration_type: CLINICAL
    3.
    datatype: PATIENT_ATTRIBUTES or SAMPLE_ATTRIBUTES
    4.
    data_filename: your datafile

Examples

An example metadata file, e.g. named meta_clinical_sample.txt, would be:
1
cancer_study_identifier: brca_tcga_pub
2
genetic_alteration_type: CLINICAL
3
datatype: SAMPLE_ATTRIBUTES
4
data_filename: data_clinical_sample.txt
Copied!
An example metadata file, e.g. named meta_clinical_patient.txt, would be:
1
cancer_study_identifier: brca_tcga_pub
2
genetic_alteration_type: CLINICAL
3
datatype: PATIENT_ATTRIBUTES
4
data_filename: data_clinical_patient.txt
Copied!

Data files

For both patients and samples, the clinical data file is a two dimensional matrix with multiple clinical attributes. When the attributes are defined in the patient file they are considered to be patient attributes; when they are defined in the sample file they are considered to be sample attributes.
The first four rows of the clinical data file contain tab-delimited metadata about the clinical attributes. These rows have to start with a '#' symbol. Each of these four rows contain different type of information regarding each of the attributes that are defined in the fifth row:
    Row 1: The attribute Display Names: The display name for each clinical attribute
    Row 2: The attribute Descriptions: Long(er) description of each clinical attribute
    Row 3: The attribute Datatype: The datatype of each clinical attribute (must be one of: STRING, NUMBER, BOOLEAN)
    Row 4: The attribute Priority: A number which indicates the importance of each attribute. In the future, higher priority attributes will appear in more prominent places than lower priority ones on relevant pages (such as the Study View). A higher number indicates a higher priority.
    1
    To promote certain chart in study view, please increase priority to a certain number. The higher the score, the higher priority it will be displayed in the study view.
    2
    If you want to hide chart, please set the priority to 0. For combination chart, as long as one of the clinical attribute has been set to 0, it will be hidden.
    3
    4
    Currently, we preassigned priority to few charts, but as long as you assign a priority except than 1, these preassigned priorities will be overwritten.
    5
    6
    CANCER_TYPE: 3000, CANCER_TYPE_DETAILED: 2000,
    7
    Overall survival plot: 400 (This is combination of OS_MONTH and OS_STATUS)
    8
    Disease Free Survival Plot: 300 (This is combination of DFS_MONTH and DFS_STATUS)
    9
    Mutation Count vs. CNA Scatter Plot: 200,
    10
    Mutated Genes Table: 90, CNA Genes Table: 80, study_id: 70, # of Samples Per Patient: 40,
    11
    With Mutation Data Pie Chart: 60, With CNA Data Pie Chart: 50,
    12
    Mutation Count Bar Chart: 30, CNA Bar Chart: 20,
    13
    GENDER: 9, SEX: 9, AGE: 8
    Copied!
    Please note: Priority is not the sole factor determining which chart will be displayed first. A layout algorithm in study view also makes a minor adjustment on the layout. The algorithm tries to fit all charts into a 2 by 2 matrix (Mutated Genes Table occupies 2 by 2 space). When a chart can not be fitted in the first matrix, the second matrixed will be generated. And the second matrix will have lower priority than the first one. If later chart can fit into the first matrix, then its priority will be promoted.
    Please see here for more detailed information about how study view utilize priority and how the layout is calculated based on priority.
    Row 5: The attribute name for the database: This name should be in upper case.
    Row 6: This is the first row that contains actual data.

Example clinical header

Below is an example of the first 4 rows with the respective metadata for the attributes defined in the 5th row.
1
#Patient Identifier<TAB>Overall Survival Status<TAB>Overall Survival (Months)<TAB>Disease Free Status<TAB>Disease Free (Months)<TAB>...
2
#Patient identifier<TAB>Overall survival status<TAB>Overall survival in months since diagnosis<TAB>Disease free status<TAB>Disease free in months since treatment<TAB>...
3
#STRING<TAB>STRING<TAB>NUMBER<TAB>STRING<TAB>NUMBER<TAB>...
4
#1<TAB>1<TAB>1<TAB>1<TAB>1<TAB>
5
PATIENT_ID<TAB>OS_STATUS<TAB>OS_MONTHS<TAB>DFS_STATUS<TAB>DFS_MONTHS<TAB>...
6
....
7
data - see examples below
8
....
Copied!

Clinical patient columns

The file containing the patient attributes has one required column:
    PATIENT_ID (required): a unique patient ID. This field allows only numbers, letters, points, underscores and hyphens.
The following columns are used by the study view as well as the patient view. In the study view they are used to create the survival plots. In the patient view they are used to add information to the header.
Note on survival plots: to generate the survival plots successfully, the columns are required to be in pairs, which means the file should have a pair of columns that have the same prefix but ending with _STATUS and _MONTHS individually. For example, PFS_STATUS and PFS_MONTHS are a valid pair of columns that can generate the survival plots.
Note on survival status value: the value of survival status must prefixed with 0: or 1:. Value with prefix 0: means that no event (e.g. LIVING, DiseaseFree). Value with prefix 1: means that an event occurred (e.g. DECEASED, Recurred/Progressed).
    OS_STATUS: Overall patient survival status
      Possible values: 1:DECEASED, 0:LIVING
      In the patient view, 0:LIVING creates a green label, 1:DECEASED a red label.
    OS_MONTHS: Overall survival in months since initial diagnosis
    DFS_STATUS: Disease free status since initial treatment
      Possible values: 0:DiseaseFree, 1:Recurred/Progressed
      In the patient view, 0:DiseaseFree creates a green label, 1:Recurred/Progressed a red label.
    DFS_MONTHS: Disease free (months) since initial treatment
These columns, when provided, add additional information to the patient description in the header:
    PATIENT_DISPLAY_NAME: Patient display name (string)
    GENDER or SEX: Gender or sex of the patient (string)
    AGE: Age at which the condition or disease was first diagnosed, in years (number)
    TUMOR_SITE
Custom attributes:

Example patient data file

1
#Patient Identifier<TAB>Overall Survival Status<TAB>Overall Survival (Months)<TAB>Disease Free Status<TAB>Disease Free (Months)<TAB>...
2
#Patient identifier<TAB>Overall survival status<TAB>Overall survival in months since diagnosis<TAB>Disease free status<TAB>Disease free in months since treatment<TAB>...
3
#STRING<TAB>STRING<TAB>NUMBER<TAB>STRING<TAB>NUMBER<TAB>...
4
#1<TAB>1<TAB>1<TAB>1<TAB>1<TAB>
5
PATIENT_ID<TAB>OS_STATUS<TAB>OS_MONTHS<TAB>DFS_STATUS<TAB>DFS_MONTHS<TAB>...
6
PATIENT_ID_1<TAB>1:DECEASED<TAB>17.97<TAB>1:Recurred/Progressed<TAB>30.98<TAB>...
7
PATIENT_ID_2<TAB>0:LIVING<TAB>63.01<TAB>0:DiseaseFree<TAB>63.01<TAB>...
8
...
Copied!

Clinical sample columns

The file containing the sample attributes has two required columns:
    PATIENT_ID (required): A patient ID. This field can only contain numbers, letters, points, underscores and hyphens.
    SAMPLE_ID (required): A sample ID. This field can only contain numbers, letters, points, underscores and hyphens.
By adding PATIENT_ID here, cBioPortal will map the given sample to this patient. This enables one to associate multiple samples to one patient. For example, a single patient may have had multiple biopsies, each of which has been genomically profiled. See this example for a patient with multiple samples.
The following columns are required for the pan-cancer summary statistics tab (example).
    CANCER_TYPE: Cancer Type
    CANCER_TYPE_DETAILED: Cancer Type Detailed, a sub-type of the specified CANCER_TYPE
The following columns affect the header of the patient view by adding text to the samples in the header:
    SAMPLE_DISPLAY_NAME: displayed in addition to the ID
    SAMPLE_CLASS
    METASTATIC_SITE or PRIMARY_SITE: Override TUMOR_SITE (patient level attribute) depending on sample type
The following columns additionally affect the Timeline data visualization:
    OTHER_SAMPLE_ID: sometimes the timeline data (see the timeline data section) will not have the SAMPLE_ID but instead an alias to the sample (in the field SPECIMEN_REFERENCE_NUMBER). To ensure that the timeline data field SPECIMEN_REFERENCE_NUMBER is correctly linked to this sample, be sure to add this column OTHER_SAMPLE_ID as an attribute to your sample attributes file.
    SAMPLE_TYPE, TUMOR_TISSUE_SITE or TUMOR_TYPE: gives sample icon in the timeline a color.
      If set to recurrence, recurred, progression or progressed: orange
      If set to metastatic or metastasis: red
      If set to primary or otherwise: black
Custom attributes:

Example sample data file

1
#Patient Identifier<TAB>Sample Identifier<TAB>Subtype<TAB>...
2
#Patient identifier<TAB>Sample Identifier<TAB>Subtype description<TAB>...
3
#STRING<TAB>STRING<TAB>STRING<TAB>...
4
#1<TAB>1<TAB>1<TAB>...
5
PATIENT_ID<TAB>SAMPLE_ID<TAB>SUBTYPE<TAB>...
6
PATIENT_ID_1<TAB>SAMPLE_ID_1<TAB>basal-like<TAB>...
7
PATIENT_ID_2<TAB>SAMPLE_ID_2<TAB>Her2 enriched<TAB>...
8
...
Copied!

Columns with specific functionality

These columns can be in either the patient or sample file.
    CANCER_TYPE: Overrides study wide cancer type
    CANCER_TYPE_DETAILED
    KNOWN_MOLECULAR_CLASSIFIER
    GLEASON_SCORE: Radical prostatectomy Gleason score for prostate cancer
    HISTOLOGY
    TUMOR_STAGE_2009
    TUMOR_GRADE
    ETS_RAF_SPINK1_STATUS
    TMPRSS2_ERG_FUSION_STATUS
    ERG_FUSION_ACGH
    SERUM_PSA
    DRIVER_MUTATIONS

Custom columns in clinical data

cBioPortal supports custom columns with clinical data in either the patient or sample file. They should follow the previously described 5-row header format. Be sure to provide the correct Datatype, for optimal search, sorting, filtering (in clinical data tab) and visualization.
The Clinical Data Dictionary from MSKCC is used to normalize clinical data, and should be followed to make the clinical data comparable between studies. This dictionary provides a definition whether an attribute should be defined on the patient or sample level, as well as provides a name, description and datatype. The data curator can choose to ignore these proposed definitions, but not following this dictionary might make comparing data between studies more difficult. It should however not break any cBioPortal functionality. See GET /api/ at https://oncotree.mskcc.org/cdd/swagger-ui.html#/ for the data dictionary of all known clinical attributes.

Banned column names

MUTATION_COUNT and FRACTION_GENOME_ALTERED are auto populated clinical attributes, and should therefore not be present in clinical data files.

Discrete Copy Number Data

The discrete copy number data file contain values that would be derived from copy-number analysis algorithms like GISTIC 2.0 or RAE. GISTIC 2.0 can be installed or run online using the GISTIC 2.0 module on GenePattern. For some help on using GISTIC 2.0, check the Data Loading: Tips and Best Practices page. When loading case list data, the _cna case list is required. See the case list section.

Meta file

The meta file is comprised of the following fields:
    1.
    cancer_study_identifier: same value as specified in study meta file
    2.
    genetic_alteration_type: COPY_NUMBER_ALTERATION
    3.
    datatype: DISCRETE
    4.
    stable_id: gistic, cna, cna_rae or cna_consensus
    5.
    show_profile_in_analysis_tab: true
    6.
    profile_name: A name for the discrete copy number data, e.g., "Putative copy-number alterations from GISTIC"
    7.
    profile_description: A description of the copy number data, e.g., "Putative copy-number from GISTIC 2.0. Values: -2 = homozygous deletion; -1 = hemizygous deletion; 0 = neutral / no change; 1 = gain; 2 = high level amplification."
    8.
    data_filename: your datafile
    9.
    gene_panel (Optional): gene panel stable id
    10.
    pd_annotations_filename (Optional): name of custom driver annotations file

Example

An example metadata file could be named meta_cna.txt and its contents could be:
1
cancer_study_identifier: brca_tcga_pub
2
genetic_alteration_type: COPY_NUMBER_ALTERATION
3
datatype: DISCRETE
4
stable_id: gistic
5
show_profile_in_analysis_tab: true
6
profile_name: Putative copy-number alterations from GISTIC
7
profile_description: Putative copy-number from GISTIC 2.0. Values: -2 = homozygous deletion; -1 = hemizygous deletion; 0 = neutral / no change; 1 = gain; 2 = high level amplification.
8
data_filename: data_cna.txt
9
pd_annotations_filename: data_cna_pd_annotations.txt
Copied!

Data file

For each gene (row) in the data file, the following columns are required in the order specified:
One or both of:
And:
    An additional column for each sample in the dataset using the sample id as the column header.
For each gene-sample combination, a copy number level is specified:
    "-2" is a deep loss, possibly a homozygous deletion
    "-1" is a single-copy loss (heterozygous deletion)
    "0" is diploid
    "1" indicates a low-level gain
    "2" is a high-level amplification.

Example

An example data file which includes the required column header would look like:
1
Hugo_Symbol<TAB>Entrez_Gene_Id<TAB>SAMPLE_ID_1<TAB>SAMPLE_ID_2<TAB>...
2
ACAP3<TAB>116983<TAB>0<TAB>-1<TAB>...
3
AGRN<TAB>375790<TAB>2<TAB>0<TAB>...
4
...
5
...
Copied!

Custom driver annotations file

Custom driver annotations can be defined for discrete copy number data. These annotations can be used to complement or replace default driver annotation resources OncoKB and HotSpots. Custom driver annotations can be placed in a separate file that is referenced by the pd_annotations_file field of the meta file. The annotation file can hold the following columns:
    1.
    Hugo_Symbol (Optional): A HUGO gene symbol. Required when column Entrez_Gene_Id is not present.
    2.
    Entrez_Gene_Id (Optional): A Entrez Gene identifier. Required when column Hugo_Symbol is not present.
    3.
    SAMPLE_ID: A sample ID. This field can only contain numbers, letters, points, underscores and hyphens.
    4.
    cbp_driver (Optional): "Putative_Driver", "Putative_Passenger", "Unknown", "NA" or "" (empty value). This field must be present if the cbp_driver_annotation is also present in the MAF file.
    5.
    cbp_driver_annotation (Optional): Description field for the cbp_driver value (limited to 80 characters). This field must be present if the cbp_driver is also present in the MAF file. This field is free text. Example values for this field are: "Pathogenic" or "VUS".
    6.
    cbp_driver_tiers (Optional): Free label/category that marks the mutation as a putative driver such as "Driver", "Highly actionable", "Potential drug target". . This field must be present if the cbp_driver_tiers_annotation is also present in the MAF file. In the OncoPrint view's Mutation Color dropdown menu, these tiers are ordered alphabetically. This field is free text and limited to 20 characters. For mutations without a custom annotation, leave the field blank or type "NA".
    7.
    cbp_driver_tiers_annotation (Optional): Description field for the cbp_driver_tiers value (limited to 80 characters). This field must be present if the cbp_driver_tiers is also present in the MAF file. This field can not be present when the cbp_driver_tiers field is not present.
All genes referenced in the custom driver annotation file must be present in the data file for discrete copy number alterations.
The cbp_driver column flags the mutation as either driver or passenger. In cBioPortal, passenger mutations are also known as variants of unknown significance (VUS). The cbp_driver_tiers column assigns an annotation tier to the mutation, such as "Driver", "Highly actionable" or "Potential drug target". When a tier is selected, mutations with that annotation are highlighted as driver. Both types of custom annotations contain a second column with the suffix _annotation, to add a description. This is displayed in the tooltip that appears when hovering over the sample's custom annotation icon in the OncoPrint view.
You can learn more about configuring these annotations in the portal.properties documentation. When properly configured, the customized annotations appear in the "Mutation Color" menu of the OncoPrint view:

Example

An example data file which includes the required column header would look like:
1
SAMPLE_ID<TAB>Hugo_Symbol<TAB>Entrez_Gene_Id<TAB>cbp_driver<TAB>cbp_driver_annotation<TAB>cbp_driver_tiers<TAB>cbp_driver_tiers_annotation<TAB>...
2
TCGA-BH-A0E6-01<TAB>GENEA<TAB>116983<TAB>Putative_Driver<TAB>see: PMID:12345678<TAB>Highly actionable<TAB>Per decision 01/01/2020<TAB>
3
TCGA-BH-A0E6-01<TAB>GENEB<TAB>375790<TAB>Putative_Passenger<TAB>see: PMID:12345678<TAB><TAB><TAB>
4
...
Copied!

GISTIC 2.0 Format

GISTIC 2.0 outputs a tabular file similarly formatted to the cBioPortal format, called <prefix>_all_thresholded.by_genes.txt. In this file the gene symbol is found in the Gene Symbol column, while Entrez gene IDs are in the Gene ID or Locus ID column. Please rename Gene Symbol to Hugo_Symbol and Gene ID or Locus ID to Entrez_Gene_Id. The Cytoband column can be kept in the table, but note that these values are ignored in cBioPortal. cBioPortal uses cytoband annotations from the map_location column in NCBI's Homo_sapiens.gene_info.gz when loading genes into the seed database.

Continuous Copy Number Data

Meta file

The continuous copy number metadata file should contain the following fields:
    1.
    cancer_study_identifier: same value as specified in study meta file
    2.
    genetic_alteration_type: COPY_NUMBER_ALTERATION.
    3.
    datatype: CONTINUOUS
    4.
    stable_id: linear_CNA
    5.
    show_profile_in_analysis_tab: false.
    6.
    profile_name: A name for the copy number data, e.g., "copy-number values".
    7.
    profile_description: A description of the copy number data, e.g., "copy-number values for each gene (from Affymetrix SNP6).".
    8.
    data_filename: your datafile
    9.
    gene_panel (Optional): gene panel stable id
cBioPortal also supports log2 copy number data. If your data is in log2, change the following fields:
    1.
    datatype: LOG2-VALUE
    2.
    stable_id: log2CNA

Example

An example metadata file, e.g. meta_log2_cna.txt, would be:
1
cancer_study_identifier: brca_tcga_pub
2
genetic_alteration_type: COPY_NUMBER_ALTERATION
3
datatype: LOG2-VALUE
4
stable_id: log2CNA
5
show_profile_in_analysis_tab: false
6
profile_description: Log2 copy-number values for each gene (from Affymetrix SNP6).
7
profile_name: Log2 copy-number values
8
data_filename: data_log2_cna.txt
Copied!

Data file

The log2 copy number data file follows the same format as expression data files. See Expression Data for a description of the expression data file format.

GISTIC 2.0 Format

GISTIC 2.0 outputs a tabular file similarly formatted to the cBioPortal format, called <prefix>_all_data_by_genes.txt. In this file the gene symbol is found in the Gene Symbol column, while Entrez gene IDs are in the Gene ID or Locus ID column. Please rename Gene Symbol to Hugo_Symbol and Gene ID or Locus ID to Entrez_Gene_Id. The Cytoband column can be kept in the table, but note that these values are ignored in cBioPortal. cBioPortal uses cytoband annotations from the map_location column in NCBI's Homo_sapiens.gene_info.gz when loading genes into the seed database.

Segmented Data

A SEG file (segmented data; .seg or .cbs) is a tab-delimited text file that lists loci and associated numeric values. The segmented data file format is the output of the Circular Binary Segmentation algorithm (Olshen et al., 2004). This Segment data enables the 'CNA' lane in the Genomic overview of the Patient view (as can be seen in this example).

Meta file

The segmented metadata file should contain the following fields:
    1.
    cancer_study_identifier: same value as specified in study meta file
    2.
    genetic_alteration_type: COPY_NUMBER_ALTERATION
    3.
    datatype: SEG
    4.
    reference_genome_id: Reference genome version. Supported values: "hg19"
    5.
    description: A description of the segmented data, e.g., "Segment data for the XYZ cancer study.".
    6.
    data_filename: your datafile

Example:

An example metadata file, e.g. meta_cna_hg19_seg.txt, would be:
1
cancer_study_identifier: brca_tcga_pub
2
genetic_alteration_type: COPY_NUMBER_ALTERATION
3
datatype: SEG
4
reference_genome_id: hg19
5
description: Somatic CNA data (copy number ratio from tumor samples minus ratio from matched normals) from TCGA.
6
data_filename: data_cna_hg19.seg
Copied!

Data file

The first row contains column headings and each subsequent row contains a locus and an associated numeric value. See also the Broad IGV page on this format.

Example:

An example data file which includes the required column header would look like:
1
ID<TAB>chrom<TAB>loc.start<TAB>loc.end<TAB>num.mark<TAB>seg.mean
2
SAMPLE_ID_1<TAB>1<TAB>3208470<TAB>245880329<TAB>128923<TAB>0.0025
3
SAMPLE_ID_2<TAB>2<TAB>474222<TAB>5505492<TAB>2639<TAB>-0.0112
4
SAMPLE_ID_2<TAB>2<TAB>5506070<TAB>5506204<TAB>2<TAB>-1.5012
5
SAMPLE_ID_2<TAB>2<TAB>5512374<TAB>159004775<TAB>80678<TAB>-0.0013
6
...
7
...
Copied!

Expression Data

An expression data file is a two dimensional matrix with a gene per row and a sample per column. For each gene-sample pair, a real number represents the gene expression in that sample.

Meta file

The expression metadata file should contain the following fields:
    1.
    cancer_study_identifier: same value as specified in study meta file
    2.
    genetic_alteration_type: MRNA_EXPRESSION
    3.
    datatype: CONTINUOUS, DISCRETE or Z-SCORE
    4.
    stable_id: see table below.
    5.
    source_stable_id: Required when both conditions are true: (1) datatype = Z-SCORE and (2) this study contains GSVA data. Should contain stable_id of the expression file for which this Z-SCORE file is the statistic.
    6.
    show_profile_in_analysis_tab: false (you can set to true if Z-SCORE to enable it in the oncoprint, for example).
    7.
    profile_name: A name for the expression data, e.g., "mRNA expression (microarray)".
    8.
    profile_description: A description of the expression data, e.g., "Expression levels (Agilent microarray).".
    9.
    data_filename: your datafile
    10.
    gene_panel (Optional): gene panel stable id

Supported stable_id values for MRNA_EXPRESSION

For historical reasons, cBioPortal expects the stable_id to be one of those listed in the following static set. The stable_id for continuous RNA-seq data has two options: rna_seq_mrna or rna_seq_v2_mrna. These options were added to distinguish between two different TCGA pipelines, which perform different types of normalization (RPKM and RSEM). However, for custom datasets either one of these stable_id can be chosen.
datatype
stable_id
description
CONTINUOUS
mrna_U133
Affymetrix U133 Array
Z-SCORE
mrna_U133_Zscores
Affymetrix U133 Array
Z-SCORE
rna_seq_mrna_median_Zscores
RNA-seq data
Z-SCORE
mrna_median_Zscores
mRNA data
CONTINUOUS
rna_seq_mrna
RNA-seq data
CONTINUOUS
rna_seq_v2_mrna
RNA-seq data
Z-SCORE
rna_seq_v2_mrna_median_Zscores
RNA-seq data
CONTINUOUS
mirna
MicroRNA data
Z-SCORE
mirna_median_Zscores
MicroRNA data
Z-SCORE
mrna_merged_median_Zscores
?
CONTINUOUS
mrna
mRNA data
DISCRETE
mrna_outliers
mRNA data of outliers
Z-SCORE
mrna_zbynorm
?
CONTINUOUS
rna_seq_mrna_capture
data from Roche mRNA Capture Kit
Z-SCORE
rna_seq_mrna_capture_Zscores
data from Roche mRNA Capture Kit

Example

An example metadata, e.g. meta_expression.txt file would be:
1
cancer_study_identifier: brca_tcga_pub
2
genetic_alteration_type: MRNA_EXPRESSION
3
datatype: CONTINUOUS
4
stable_id: rna_seq_mrna
5
show_profile_in_analysis_tab: false
6
profile_name: mRNA expression
7
profile_description: Expression levels
8
data_filename: data_expression.txt
Copied!

Data file

For each gene (row) in the data file, the following columns are required in the order specified:
One or both of:
And:
    An additional column for each sample in the dataset using the sample id as the column header.
For each gene-sample combination, a value is specified:
    A real number for each sample id (column) in the dataset, representing the expression value for the gene in the respective sample.
    or NA for when the expression value for the gene in the respective sample could not (or was not) be measured (or detected).

z-score instructions

For mRNA expression data, we typically expect the relative expression of an individual gene and tumor to the gene's expression distribution in a reference population. That reference population is either all tumors that are diploid for the gene in question, or, when available, normal adjacent tissue. The returned value indicates the number of standard deviations away from the mean of expression in the reference population (Z-score). This measure is useful to determine whether a gene is up- or down-regulated relative to the normal samples or all other tumor samples. Note, the importer tool can create normalized (z-score) expression data on your behalf. Please visit the Z-Score normalization script wiki page for more information. A corresponding z-score metadata file would be something like:
1
cancer_study_identifier: brca_tcga_pub
2
genetic_alteration_type: MRNA_EXPRESSION
3
datatype: Z-SCORE
4
stable_id: rna_seq_mrna_median_Zscores
5
show_profile_in_analysis_tab: true
6
profile_name: mRNA expression z-scores
7
profile_description: Expression levels z-scores
8
data_filename: data_expression_zscores.txt
Copied!

Examples of data files:

An example data file which includes the required column header and leaves out Hugo_Symbol (recommended) would look like:
1
Entrez_Gene_Id<TAB>SAMPLE_ID_1<TAB>SAMPLE_ID_2<TAB>...
2
116983<TAB>-0.005<TAB>-0.550<TAB>...
3
375790<TAB>0.142<TAB>0.091<TAB>...
4
...
5
...
Copied!
An example data file which includes both Hugo_Symbo and Entrez_Gene_Id would look like (supported, but not recommended as it increases the chances of errors regarding ambiguous gene symbols):
1
Hugo_Symbol<TAB>Entrez_Gene_Id<TAB>SAMPLE_ID_1<TAB>SAMPLE_ID_2<TAB>...
2
ACAP3<TAB>116983<TAB>-0.005<TAB>-0.550<TAB>...
3
AGRN<TAB>375790<TAB>0.142<TAB>0.091<TAB>...
4
...
5
...
Copied!
An example data file with only Hugo_Symbol column (supported, but not recommended as it increases the chances of errors regarding ambiguous gene symbols):
1
Hugo_Symbol<TAB>SAMPLE_ID_1<TAB>SAMPLE_ID_2<TAB>...
2
ACAP3<TAB>-0.005<TAB>-0.550<TAB>...
3
AGRN<TAB>0.142<TAB>0.091<TAB>...
4
...
5
...
Copied!

Mutation Data

When loading mutation data, the _sequenced case list is required. See the case list section.

Meta file

The mutation metadata file should contain the following fields:
    1.
    cancer_study_identifier: same value as specified in study meta file
    2.
    genetic_alteration_type: MUTATION_EXTENDED
    3.
    datatype: MAF
    4.
    stable_id: mutations
    5.
    show_profile_in_analysis_tab: true
    6.
    profile_name: A name for the mutation data, e.g., "Mutations".
    7.
    profile_description: A description of the mutation data, e.g., "Mutation data from whole exome sequencing.".
    8.
    data_filename: your data file
    9.
    gene_panel (optional): gene panel stable id. See Gene panels for mutation data.
    10.
    swissprot_identifier (optional): accession or name, indicating the type of identifier in the SWISSPROT column
    11.
    variant_classification_filter (optional): List of Variant_Classifications values to be filtered out.
    12.
    namespaces (optional): Comma-delimited list of namespaces to import.

Gene panels for mutation data

Using the gene_panel property it is possible to annotate all samples in the MAF file as being profiled on the same specified gene panel.
Please use the Gene Panel Matrix file when:
    Data contains samples that are profiled but no mutations are called. Also please add these to the _sequenced case list.
    Multiple gene panels are used to profile the samples in the MAF file.

Variant classification filter

The variant_classification_filter field can be used to filter out specific mutations. This field should contain a comma separated list of Variant_Classification values. By default, cBioPortal filters out Silent, Intron, IGR, 3'UTR, 5'UTR, 3'Flank and 5'Flank, except for the promoter mutations of the TERT gene. For no filtering, include this field in the metadata file, but leave it empty. For cBioPortal default filtering, do not include this field in the metadata file. Allowed values to filter out (mainly from Mutation Annotation Format page): Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR, Intron, RNA, Targeted_Region, De_novo_Start_InFrame, De_novo_Start_OutOfFrame, Splice_Region and Unknown

Tumor seq allele ambiguity

Bugs may exist in MAF data that make it ambiguous as to whether Tumor_Seq_Allele1 or Tumor_Seq_Allele2 should be seen as the variant allele to be used when a new mutation record is created and imported in cBioPortal. In such cases, preference is given to the tumor seq allele value that matches a valid nucleotide pattern ^[ATGC]*$ versus a null or empty value, or "-". For example, given Reference_Allele = "G", Tumor_Seq_Allele = "-", and Tumor_Seq_Allele2 = "A", preference will be given to Tumor_Seq_Allele2. Using this same example with Tumor_Seq_Allele1 = "T", preference will be given to Tumor_Seq_Allele1 if it does not match Reference_Allele, which in this case it does not.
When curating MAF data, it is best practice to leave Tumor_Seq_Allele1 empty if this information is not provided in your data source to avoid this ambiguity.

Namespaces

The namespaces field can be used to specify additional MAF columns for import. This field should contain a comma separated list of namespaces. Namespaces can be identified as prefixes to an arbitrary set of additional MAF columns (separated with a period e.g ASCN.total_copy_number, ASCN.minor_copy_number). All columns with a prefix matching a namespace specified in the metafile will be imported; columns with an unspecified namespace will be ignored. If no additional columns beyond the required set need to be imported, the field should be left blank.

Example

An example metadata file would be:
1
cancer_study_identifier: brca_tcga_pub
2
genetic_alteration_type: MUTATION_EXTENDED
3
datatype: MAF
4
stable_id: mutations
5
show_profile_in_analysis_tab: true
6
profile_description: Mutation data from whole exome sequencing.
7
profile_name: Mutations
8
data_filename: data_mutations.txt
9
namespaces: ASCN
Copied!

Data file

The mutation data file extends the Mutation Annotation Format (MAF) created as part of The Cancer Genome Atlas (TCGA) project, by adding extra annotations to each mutation record. This section describes two types of MAF files:
    1.
    A minimal MAF file with only the columns required for cBioPortal.
    2.
    An extended MAF file created with vcf2maf, maf2maf or the Genome Nexus Annotation Pipeline.

Minimal MAF format

A minimal mutation annotations file can contain just three of the MAF columns plus one annotation column. From this minimal MAF, it is possible to create an extended MAF by running maf2maf.
    1.
    Hugo_Symbol (Required): (MAF column) A HUGO gene symbol.
    2.
    Tumor_Sample_Barcode (Required): (MAF column) This is the sample ID as listed in the clinical data file.
    3.
    Variant_Classification (Required): (MAF column) Translational effect of variant allele. Allowed values (from Mutation Annotation Format page): Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR, Intron, RNA, Targeted_Region, De_novo_Start_InFrame, De_novo_Start_OutOfFrame. cBioPortal skips the following types during the import: Silent, Intron, 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR and RNA. Two extra values are allowed by cBioPortal here as well: Splice_Region, Unknown.
    the values should be in the correct case. E.g. missense_mutation is not allowed, while Missense_Mutation is.
    4.
    HGVSp_Short (Required): (annotation column) Amino Acid Change, e.g. p.V600E.
Next to Hugo_Symbol, it is recommended to have the Entrez gene ID:
    1.
    Entrez_Gene_Id (Optional, but recommended) : An Entrez Gene identifier.
The following extra annotation columns are important for making sure mutation specific UI functionality works well in the portal:
    1.
    Protein_position (Optional): (annotation column) Required to initialize the 3D viewer in mutations view
    2.
    SWISSPROT (Optional): (annotation column) UniProtKB/SWISS-PROT name (formerly called ID) or accession code depending on the value of the swissprot_identifier metadatum, e.g. O11H1_HUMAN or Q8NG94. Is not required, but not having it may result in inconsistent PDB structure matching in mutations view.

Creating an extended MAF file with vcf2maf or maf2maf

If your mutation data is already in VCF format (which most variant callers produce by default) you can use the vcf2maf converter. This tool parses VCF and MAF files, runs Ensembl Variant Effect Predictor (VEP) and selects a single effect per variant. Protein identifiers should be mapped to UniProt canonical isoforms by adding the --custom-enst flag and this mapping file. This will override the Ensembl canonical isoforms with UniProt canonical isoforms, which ensures the SWISSPROT column can be used correctly by cBioPortal.

Extended MAF format

The extended MAF format recognized by the portal has:
    32 columns from the TCGA MAF format.
    1 column with the amino acid change.
    4 columns with information on reference and variant allele counts in tumor and normal samples.
    1.
    Hugo_Symbol (Required): A HUGO gene symbol.
    2.
    Entrez_Gene_Id (Optional, but recommended): A Entrez Gene identifier.
    3.
    Center (Optional): The sequencing center.
    4.
    NCBI_Build (Required)1: The Genome Reference Consortium Build is used by a variant calling software. It must be "GRCh37" or "GRCh38" for a human, and "GRCm38" for a mouse.
    5.
    Chromosome (Optional): A chromosome number, e.g., "7".
    6.
    Start_Position (Optional): Start position of event.
    7.
    End_Position (Optional): End position of event.
    8.
    Strand (Optional): We assume that the mutation is reported for the + strand.
    9.
    Variant_Classification (Required): Translational effect of variant allele, e.g. Missense_Mutation, Silent, etc.
    10.
    Variant_Type 1(Optional): Variant Type, e.g. SNP, DNP, etc.
    11.
    Reference_Allele (Optional): The plus strand reference allele at this position.
    12.
    Tumor_Seq_Allele1 (Optional): Primary data genotype.
    13.
    Tumor_Seq_Allele2 (Optional): Primary data genotype.
    14.
    dbSNP_RS1 (Optional): Latest dbSNP rs ID.
    15.
    dbSNP_Val_Status1 (Optional): dbSNP validation status.
    16.
    Tumor_Sample_Barcode (Required): This is the sample ID. Either a TCGA barcode (patient identifier will be extracted), or for non-TCGA data, a literal SAMPLE_ID as listed in the clinical data file.
    17.
    Matched_Norm_Sample_Barcode1 (Optional): The sample ID for the matched normal sample.
    18.
    Match_Norm_Seq_Allele1 (Optional): Primary data.
    19.
    Match_Norm_Seq_Allele2 (Optional): Primary data.
    20.
    Tumor_Validation_Allele1 (Optional): Secondary data from orthogonal technology.
    21.
    Tumor_Validation_Allele2 (Optional): Secondary data from orthogonal technology.
    22.
    Match_Norm_Validation_Allele11 (Optional): Secondary data from orthogonal technology.
    23.
    Match_Norm_Validation_Allele21 (Optional): Secondary data from orthogonal technology.
    24.
    Verification_Status1 (Optional): Second pass results from independent attempt using same methods as primary data source. "Verified", "Unknown" or "NA".
    25.
    Validation_Status (Optional): Second pass results from orthogonal technology. "Valid", "Invalid", "Untested", "Inconclusive", "Redacted", "Unknown" or "NA".
    26.
    Mutation_Status (Optional): "Somatic" or "Germline" are supported by the UI in Mutations tab. "None", "LOH" and "Wildtype" will not be loaded. Other values will be displayed as text.
    27.
    Sequencing_Phase1 (Optional): Indicates current sequencing phase.
    28.
    Sequence_Source1 (Optional): Molecular assay type used to produce the analytes used for sequencing.
    29.
    Validation_Method1 (Optional): The assay platforms used for the validation call.
    30.
    Score1 (Optional): Not used.
    31.
    BAM_File1 (Optional): Not used.
    32.
    Sequencer1 (Optional): Instrument used to produce primary data.
    33.
    HGVSp_Short (Required): Amino Acid Change, e.g. p.V600E.
    34.
    t_alt_count (Optional): Variant allele count (tumor).
    35.
    t_ref_count (Optional): Reference allele count (tumor).
    36.
    n_alt_count (Optional): Variant allele count (normal).
    37.
    n_ref_count (Optional): Reference allele count (normal).
1 These columns are currently not shown in the Mutation tab and Patient view.

Custom driver annotations

It is possible to manually add columns for defining custom driver annotations. These annotations can be used to complement or replace default driver annotation resources OncoKB and HotSpots.
    1.
    cbp_driver (Optional): "Putative_Driver", "Putative_Passenger", "Unknown", "NA" or "" (empty value). This field must be present if the cbp_driver_annotation is also present in the MAF file.
    2.
    cbp_driver_annotation (Optional): Description field for the cbp_driver value (limited to 80 characters). This field must be present if the cbp_driver is also present in the MAF file. This field is free text. Example values for this field are: "Pathogenic" or "VUS".
    3.
    cbp_driver_tiers (Optional): Free label/category that marks the mutation as a putative driver such as "Driver", "Highly actionable", "Potential drug target". . This field must be present if the cbp_driver_tiers_annotation is also present in the MAF file. In the OncoPrint view's Mutation Color dropdown menu, these tiers are ordered alphabetically. This field is free text and limited to 20 characters. For mutations without a custom annotation, leave the field blank or type "NA".
    4.
    cbp_driver_tiers_annotation (Optional): Description field for the cbp_driver_tiers value (limited to 80 characters). This field must be present if the cbp_driver_tiers is also present in the MAF file. This field can not be present when the cbp_driver_tiers field is not present.
The cbp_driver column flags the mutation as either driver or passenger. In cBioPortal, passenger mutations are also known as variants of unknown significance (VUS). The cbp_driver_tiers column assigns an annotation tier to the mutation, such as "Driver", "Highly actionable" or "Potential drug target". When a tier is selected, mutations with that annotation are highlighted as driver. Both types of custom annotations contain a second column with the suffix _annotation, to add a description. This is displayed in the tooltip that appears when hovering over the sample's custom annotation icon in the OncoPrint view.
You can learn more about configuring these annotations in the portal.properties documentation. When properly configured, the customized annotations appear in the "Mutation Color" menu of the OncoPrint view:

Adding your own mutation annotation columns

Adding additional mutation annotation columns to the extended MAF rows can also be done. In this way, the portal will parse and store your own MAF fields in the database. For example, mutation data that you find on cBioPortal.org comes from MAF files that have been further enriched with information from mutationassessor.org, which leads to a "Mutation Assessor" column in the mutation table.

Adding mutation annotation columns through namespaces

Additional columns may also be added into the MAF and imported through the namespace mechanism. Any columns starting with a prefix specified in the namespaces field in the metafile will be imported into the database. Namespace columns should be formatted as the namespace and namespace attribute seperated with a period (e.g ASCN.total_copy_number where ASCN is the namespace and total_copy_number is the attribute).
An example MAF with the following additional columns:
1
ASCN.total_copy_number ASCN.clonal MUTATION.name MUTATION.type
Copied!
imported with the following namespaces field in the metafile:
1
namespaces: ascn
Copied!
will import the ASCN.total_copy_number and ASCN.clonal column into the database. MUTATION.name and MUTATION.type will be ignored because mutation is not specified in the namespaces field.

Representation of namespace columns by mutation API endpoints

Columns added through namespaces will be returned by mutation API endpoints. Namespace data will be available in the namespaceColumn of respective JSON representations of mutation records. The namespaceColumns property will be a JSON object where namespace data is keyed by name of the namespace in lowercase. For instance, when namespace ZYGOSITY is defined in the meta file and the data file has column ZYGOSITY.status with value Homozygous for a mutation row, the API will return the following JSON record for this mutation (only relevant fields are shown):
1
{
2
"namespaceColumns": {
3
"zygosity": {
4
"status": "Homozygous"
5
}
6
},
7
}
Copied!
Note: ASCN namespace data is not exported via the namespaceColumns field.

Allele specific copy number (ASCN) annotations

Allele specific copy number (ASCN) annotation is also supported and may be added using namespaces, described here. If ASCN data is present in the MAF, the deployed cBioPortal instance will display additional columns in the mutation table showing ASCN data.
The ASCN columns below are optional by default. If ascn is a defined namespace in meta_mutations_extended.txt, then these columns are ALL required.
    1.
    ASCN.ASCN_METHOD (Optional): Method used to obtain ASCN data e.g "FACETS".
    2.
    ASCN.CCF_EXPECTED_COPIES (Optional): Cancer-cell fraction if mutation exists on major allele.
    3.
    ASCN.CCF_EXPECTED_COPIES_UPPER (Optional): Upper error for CCF estimate.
    4.
    ASCN.EXPECTED_ALT_COPIES (Optional): Estimated number of copies harboring mutant allele.
    5.
    ASCN.CLONAL (Optional): "Clonal", "Subclonal", or "Indeterminate".
    6.
    ASCN.TOTAL_COPY_NUMBER (Optional): Total copy number of the gene.
    7.
    ASCN.MINOR_COPY_NUMBER (Optional): Copy number of the minor allele.
    8.
    ASCN.ASCN_INTEGER_COPY_NUMER (Optional): Absolute integer copy-number estimate.

Example MAF

An example MAF can be found in the cBioPortal test study study_es_0.

Filtered mutations

A special case for Entrez_Gene_Id=0 and Hugo_Symbol=Unknown: when this combination is given, the record is parsed in the same way as Variant_Classification=IGR and therefore filtered out.

Methylation Data

The Portal expects a single value for each gene in each sample, usually a beta-value from the Infinium methylation array platform.

Meta file

The methylation metadata file should contain the following fields:
    1.
    cancer_study_identifier: same value as specified in study meta file
    2.
    genetic_alteration_type: METHYLATION
    3.
    datatype: CONTINUOUS
    4.
    stable_id: "methylation_hm27" or "methylation_hm450" (depending on platform).
    5.
    show_profile_in_analysis_tab: false
    6.
    profile_name: A name for the methylation data, e.g., "Methlytation (HM27)".
    7.
    profile_description: A description of the methlytation data, e.g., "Methylation beta-values (HM27 platform). For genes with multiple methylation probes, the probe least correlated with expression is selected.".
    8.
    data_filename: your datafile
    9.
    gene_panel (Optional): gene panel stable id

Example

An example metadata file would be:
1
cancer_study_identifier: brca_tcga_pub
2
genetic_alteration_type: METHYLATION
3
datatype: CONTINUOUS
4
stable_id: methylation_hm27
5
show_profile_in_analysis_tab: false
6
profile_name: Methylation (HM27)
7
profile_description: Methylation beta-values (HM27 platform). For genes with multiple methylation probes, the probe least correlated with expression is selected.
8
data_filename: data_methylation_hm27.txt
Copied!

Data file

The methylation data file follows the same format as expression data files. See Expression Data for a description of the expression data file format. The Portal expects a single value for each gene in each sample, usually a beta-value from the Infinium methylation array platform.

Protein level Data

Protein expression measured by reverse-phase protein array or mass spectrometry. Antibody-sample pairs, with a real number representing the protein level for that sample.

Meta file

The protein level metadata file should contain the following fields:
    1.
    cancer_study_identifier: same value as specified in study meta file
    2.
    genetic_alteration_type: PROTEIN_LEVEL
    3.
    datatype: LOG2-VALUE or Z-SCORE
    4.
    stable_id: rppa, rppa_Zscores, protein_quantification or protein_quantification_zscores
    5.
    show_profile_in_analysis_tab: false (true for Z-SCORE datatype)
    6.