Small Variants

Small Variant Calling and Filtering

The DRAGEN TSO 500 ctDNA Analysis Software supports calling SNVs, indels, MNVs, and delins from cfDNA samples by using mapped and aligned DNA reads from a plasma sample as input.

Variants are detected via both column wise pileup analysis and local de novo assembly of haplotypes. The de novo haplotypes allow the detection of much larger insertions and deletions than possible through column wise pileup analysis only. Insertions and deletions called by the TSO 500 ctDNA analysis software do not have a size limitation but has different level of performance testing depending on the length, see Performance Testing page for more details.

To call variants via local de novo assembly of haplotypes in active regions, haplotypes are first generated with de Bruijn graph. The likelihood of a read supporting a haplotype is calculated using a Paired Hidden Markov Model. Somatic Score (SQ) is calculated as the joint posterior probability that a variant is present in the sample. For each variant candidate, background noise at the same site is taken into account using a systematic noise file. A p-value is calculated using the observed variant depth, total depth, and the systematic noise using binomial distribution and then converted to a variant Quality Score (AQ).

Variants are called if SQ >= 2 and AQ >= 20 for variants present in Catalogue of Somatic Mutations in Cancer (COSMIC) with count > 50 (hotspots) or if SQ >= 2 and AQ >= 60 for remaining sites (nonhotspots).

In addition, DRAGEN uses the de novo assembly to detect SNVs, insertions, and deletions that are co-phased and part of the same haplotypes. Any such co-phased variants that are within a window of 15 bp are then reassembled into complex variants (MNVs and delins).

The pipeline makes no ploidy assumptions, enabling detection of low-frequency alleles.

DRAGEN small variant calling includes the following steps:

Detects regions with sufficient read coverage (callable regions).
Detects regions where the reads deviate from the reference and there is a possibility of a germline or somatic call (active regions).
Assembles de novograph haplotypes are assembled from reads (haplotype assembly).
Extracts possible somatic or germline calls (events) from column wise pileup analysis.
Calibrates read base qualities to account for sample-specific noise.
Computes read likelihoods for each read/ haplotype pair.
Performs variant calling by summing the genotype probabilities across all reads/haplotype pairs.
Performs additional filtering to improve variant calling accuracy (see Filter Status).

Systematic Noise File

The DRAGEN TSO 500 ctDNA Analysis Software uses a systematic noise file to improve variant calling accuracy. The file indicates the statistical probability of noise at specific positions in the genome. Illumina has constructed noise files using 40-60 normal cfDNA libraries that are sequencer specific. Regions where noise is common (eg, difficult to map regions) have higher noise values. The small variant caller penalizes those regions to reduce the probability of making false positive calls.

Germline, Somatic and Clonal Hematopoiesis (CH) tagging

The Tumor Mutational Burden (TMB) module of DRAGEN TSO 500 ctDNA Analysis Software, predicts whether a small variant is of germline or somatic origin as well as whether the variant is associated with Clonal Hematopoiesis (CH). The results are output in the TMB Trace TSV and Small Variant VCF files.

Please review the TMB algorithm page for more details.

Variant statuses (somatic, germline, clonal hematopoiesis (CH) variant) are predictions intended for TMB calculation. Use caution if using them separately as their performance has not been tested outside of the TMB algorithm.

Outputs

The DRAGEN TSO 500 ctDNA Analysis Software produces several files with small variant calling results, including:

Combined Variant Output File, {SampleID}_CombinedVariantOutput.tsv
Small Variant VCF {SampleID}_hard-filtered.vcf
Small Variant Genome VCF {SAMPLE_ID}_hard-filtered.gvcf.gz
Small Variant Annotated JSON {SAMPLE_ID}_SmallVariants_Annotated.json.gz

Combined Variant Output File

File name: {SampleID}_CombinedVariantOutput.tsv

All variants with the FILTER field marked as PASS in the Small Variant Genome VCF are present in the Combined variant Output.

Gene information is only present for variants belonging to canonical transcripts that are within the Gene Allow List–Small Variants.
Transcript information is only present for variants belonging to canonical transcripts that are within the Gene Allow List–Small Variants.

Combined variant output produces small variants with blank fields in the following situations:

The variant has been matched to a canonical RefSeq transcript on an overlapping gene not targeted by TruSight Oncology 500 ctDNA.
The variant is located in a region designated iSNP, indel, or Flanking in the TST500_Manifest.bed file located in the Resources folder.

Small Variant VCF

File name: {SampleID}_hard-filtered.vcf

The Small Variant VCF file outputs all small variant calling results.

MNVs and Phased Variants

The small variant file contains both phased variants and all other small variants. The header sections from both the phased variant (complex) VCF and the small variant VCF are included in this merged VCF. Variants that are found for both phased variants and small variants are only displayed as phased variants.

Germline Status

The Small Variant VCF file contains predicted germline, somatic, and clonal hematopiesis (CH) variants that can be further filtered down using GermlineStatus in the INFO field. See this section for more details.

Filter Status

Variants can be filtered down using different tags assigned in the field FILTER as described in the table below.

Filter

Description

base_quality

Site filtered because median base quality of alt reads at this locus does not meet threshold

filtered_reads

Site filtered because too large a fraction of reads have been filtered out

fragment_length

Site filtered because absolute difference between the median fragment length of alt reads and median fragment length of ref reads at this locus exceeds threshold

low_depth

Site filtered because the read depth is too low (<1000)

low_frac_info_reads

Site filtered because the fraction of informative reads is below threshold (<0.5)

low_normal_depth

Site filtered because the normal sample read depth is too low

long_indel

Site filtered because the indel length is too long

mapping_quality

Site filtered because median mapping quality of alt reads at this locus does not meet threshold (<30)

multiallelic

Site filtered because more than two alt alleles pass tumor LOD

non_homref_normal

Site filtered because the normal sample genotype is not homozygous reference

no_reliable_supporting_read

Site filtered because no reliable supporting somatic read exists

panel_of_normals

Seen in at least one sample in the panel of normals vcf

read_position

Site filtered because median of distances between start/end of read and this locus is below threshold

RMxNRepeatRegion

Site filtered because all or part of the variant allele is a repeat of the reference

str_contraction

Site filtered due to suspected PCR error where the alt allele is one repeat unit less than the reference

too_few_supporting_reads

Site filtered because there are too few supporting reads in the tumor sample

weak_evidence

Somatic variant score does not meet threshold (SQ < 2)

systematic_noise

Site filtered based on evidence of systematic noise in normals Candidate has low AQ Score: AQ < 20 for variants with COSMIC count ≥ 50 AQ < 60 for all other sites

excluded_regions

Site overlaps with vc excluded regions bed²

² This is a static list of regions compiled by Illumina. Email Illumina Technical Support for more information.

Small Variant Genome VCF

File name: {SAMPLE_ID}_hard-filtered.gvcf.gz

The small variant genome VCF file includes the variant call status for all positions in all targeted intervals.

Small Variant Annotated JSON

File name: {SAMPLE_ID}_SmallVariants_Annotated.json.gz

The small variants annotated file provides variant annotation information for all non-reference positions in the VCF, which includes non-pass variants. The variant consequence definition is available on the Sequence Ontology website.

All pass variant calls are annotated using the Illumina Annotation Engine (IAE), also known as Nirvana, with the following information (using the RefSeq transcript):

HGNC Gene
- Transcript
- Exon
- Consequence
- c.HGVS
- p.HGVS
COSMIC

PreviousContamination NextFusions

Last updated 3 months ago

Was this helpful?

hashtagSmall Variant Calling and Filtering

hashtagSystematic Noise File

hashtagGermline, Somatic and Clonal Hematopoiesis (CH) tagging

hashtagOutputs

hashtagCombined Variant Output File

hashtagSmall Variant VCF

hashtagMNVs and Phased Variants

hashtagGermline Status

hashtagFilter Status

hashtagSmall Variant Genome VCF

hashtagSmall Variant Annotated JSON

Small Variant Calling and Filtering

Systematic Noise File

Germline, Somatic and Clonal Hematopoiesis (CH) tagging

Outputs

Combined Variant Output File

Small Variant VCF

MNVs and Phased Variants

Germline Status

Filter Status

Small Variant Genome VCF

Small Variant Annotated JSON