Mutation data transcript annotation
This document describes how each mutation in cBioPortal gets annotated with a specific gene symbol + protein change.
This section explains the concepts of protein isoforms and transcripts.
What is an isoform?
From a single gene (string of nucleotides) multiple protein sequences can be formed (string of amino acids). For example: parts of the gene that code for proteins (exons) can be included or excluded through a process known as alternative splicing. Each of the different resulting proteins is called an isoform. A single mutation can impact the isoforms differently. E.g. in one isoform it might change a P to a T, but in the other isoform that particular exon does not get included and it is therefore not changing the amino acid sequence at all. In cBioPortal for convenience sake we assign a single gene symbol + protein change to each mutation. For most cases this works well because there is only one protein isoform relevant in a clinical setting. There are of course exceptions and we are therefore working on improving this representation. An explanation of the relation between transcripts and protein isoforms can be found in the next section.
What is a transcript?
DNA is transcribed to a pre-mRNA transcript which includes intron and exon regions. Splicing and other processes then take place to form the resulting mature mRNA transcript that only contains exons, which subsequently can be translated to a protein sequence. An mRNA transcript can thus be associated with a specific protein isoform. The Ensembl database assigns ids for these transcript with names like
ENSTxxx. You can see this on e.g. the Ensembl
website for the BRAF gene:
ENST00000288602.6 is 2480 base pairs long (nucleotides ACGT) and the associated protein isoform is 766
amino acids (V/P/etc). You can see we are showing that same transcript and protein isoform on cBioPortal:
For each gene name in cBioPortal a canonical/default transcript is assigned. These assignments are stored in Genome Nexus and explained below. Although cBioPortal does not store changes to different transcripts/isoforms for each mutation in the database itself, it does allow viewing them on the Mutations Tab by re-annotating the mutations on the fly through Genome Nexus whenever a user clicks on the transcript dropdown.
The cBioPortal database stores one gene + protein change annotation for each mutation event in the database. To allow
comparing mutation data across studies it is important to annotate the mutation data (be it in MAF or VCF format) in
the same way, otherwise the gene + protein changes can mean entirely different things. For all public studies stored
in datahub we leverage Genome Nexus to do so.
Genome Nexus assigns one canonical Ensembl Transcript + gene name + protein change for
each mutation. You can find the mapping of hugo symbol to transcript id here. There are
two sets of default transcripts:
mskcc. We recommend to use the
mskcc set of transcripts when
starting from scratch, since these are more up to date and correspond to transcripts that were chosen as relevant for
clinical sequencing at MSKCC. The
uniprot set of transcripts was constructed several years ago, but we are no longer
certain about the logic on how to reconstruct them hence they are not being kept up to date. One can see the
differences between the two in this file. For the public cBioPortal (https:
//www.cbioportal.org) and datahub we are using
mskcc, for the GENIE cBioPortal (https://genie.cbioportal.org) we still use
uniprot. As of cBioPortal v5 the default is
mskcc for local installations. Prior to v5 it was
uniprot. We recommend that people upgrading to v5 consider migrating to
mskcc as well (see migration guide and the properties reference docs).
How default transcript assignment affects the Mutations Tab
The Mutations Tab shows the full protein sequence. The one shown by default is the canonical
uniprot depending on configuration). The mutations are drawn on the lollipop based on the
protein position found in the cBioPortal database. For the public cBioPortal all mutation
data in MAF format are annotated using Genome Nexus to add the gene and protein change
columns. This is then imported into the cBioPortal database. Whether you choose to use the set of
transcripts, make sure to indicate it in the [Genome Nexus Annotation Pipeline](https://github.com/genome-nexus/genome-
--isoform-override <mskcc or uniprot>) when annotating as well as in the properties file
of cBioPortal. That way the Mutations Tab will show the correct canonical
transcript. Note that whenever somebody uses the dropdown on the Mutations Tab to change the displayed transcript,
Genome Neuxs re-annotates all mutations on the fly. The browser sends over the genomic location (chrom,start,end,ref,
alt) to get the protein change information for each transcript. Since many of the annotations are for the canonical transcripts
only we are currently hiding annotations for non-canonical transcripts.
Plans for default transcripts
We are planning to move to a single set of default transcripts over time. Prior to v5
uniprot was used for the public
facing portals and local installations. Our plan is to use
mskcc everywhere and eventually we will most likely move to MANE. MANE is only
available for grch38 and since most of our data is for grch37 this is currently not feasible. Whichever set of
transcripts you choose to use, make sure to indicate so in the Genome Nexus Annotation Pipeline (
--isoform-override <mskcc or uniprot>) and put the same
set of transcripts in the properties file of cBioPortal, such that the Mutations Tab will show the correct canonical transcript (currently defaults to
mskcc). The re-annotation of mutations only happens once a user clicks to change the transcript, which is why it's important that the protein change in the database is for the specific transcript displayed first.