Understand the effects of genetic variation and mutations with DNA sequencing data analysis.
The types of DNA-sequencing
The type of DNA sequencing that is applied affects the ensuing bioinformatics primarily through limiting the analyses to particular genomic regions. Whole-genome sequencing (WGS) enables analysis of mutations anywhere in the genome, including the vast intergenic non-coding regions. Whole-exome sequencing (WES) limits the analysis to mutations in the protein-coding part of the genome. Targeted sequencing further limits the analysis to predetermined loci — a panel of known cancer genes, for instance.
Identifying and annotating mutations
The raw data from a DNA-sequencing experiment is quality-controlled and aligned against a reference genome. Variants may then be identified (or “called”) using a mutation caller pipeline. When calling somatic mutations specifically, the mutation caller takes both the tumor and normal DNA-seq data as input to distinguish between somatic and germline mutations.
A mutation caller is designed to look for specific type of mutation, such as small variants, copy-number variants or structural variants. Small variants comprise substitutions, insertions and deletions of one or a few nucleotides. Copy-number variants are amplification or deletion events affecting larger chunks of DNA. Structural variants include even more complex DNA alterations such as inter-chromosomal translocations and inversions of DNA segments.
Associating mutations to clinical and phenotypic variables
The key part in a mutation analysis workflow is visualizing identified mutations and associating them to other variables. Typical visualizations include oncoplots (or waterfall plots) which show the mutational statuses of multiple genes across analyzed patients and lollipop plots, which highlight the positions of mutations along the amino acid sequence of a mutated (and protein-coding) gene.
Statistical tests can be used to compare mutated genes in different sample groups (different cancer types, primary vs metastatic tumors, before vs after treatment etc.) Mutation frequencies, odds ratios and p-values are typical statistics reported in such analyses. Similarly, mutations can be associated to continuous variables such as the patient’s age, tumor size or level of a blood biomarker.
Survival analysis can be used to associate mutations to clinical endpoints such as death from cancer or relapse. Survival analyses rely on Kaplan-Meier estimators, Cox regression or machine learning approaches.
Mutational signature analysis
The frequencies of different types of nucleotide substitutions observed in a tumor’s DNA carries information on their cause. Simply, one type of mutagen may cause predominantly T>A substitutions whereas another one may cause G>C substitutions. Comparing the patterns of observed substitution frequencies enables quantifying previously characterized mutational signatures in a tumor. This yields insight into the etiology of the cancer, and mutational signatures are potential prognostic markers in their own right.