Small Variants

Small Variant Calling and Filtering

The DRAGEN TSO 500 ctDNA Analysis Software supports calling SNVs, indels, MNVs, and delins from cfDNA samples by using mapped and aligned DNA reads from a plasma sample as input.

Variants are detected via both column wise pileup analysis and local de novo assembly of haplotypes. The de novo haplotypes allow the detection of much larger insertions and deletions than possible through column wise pileup analysis only. Insertions and deletions called by the TSO 500 ctDNA analysis software do not have a size limitation but has different level of performance testing depending on the length, see Performance Testing page for more details.

To call variants via local de novo assembly of haplotypes in active regions, haplotypes are first generated with de Bruijn graph. The likelihood of a read supporting a haplotype is calculated using a Paired Hidden Markov Model. Somatic Score (SQ) is calculated as the joint posterior probability that a variant is present in the sample. For each variant candidate, background noise at the same site is taken into account using a systematic noise file. A p-value is calculated using the observed variant depth, total depth, and the systematic noise using binomial distribution and then converted to a variant Quality Score (AQ).

Variants are called if SQ >= 2 and AQ >= 20 for variants present in Catalogue of Somatic Mutations in Cancer (COSMIC) with count > 50 (hotspots) or if SQ >= 2 and AQ >= 60 for remaining sites (nonhotspots).

In addition, DRAGEN uses the de novo assembly to detect SNVs, insertions, and deletions that are co-phased and part of the same haplotypes. Any such co-phased variants that are within a window of 15 bp are then reassembled into complex variants (MNVs and delins).

The pipeline makes no ploidy assumptions, enabling detection of low-frequency alleles.

DRAGEN small variant calling includes the following steps:

  1. Detects regions with sufficient read coverage (callable regions).

  2. Detects regions where the reads deviate from the reference and there is a possibility of a germline or somatic call (active regions).

  3. Assembles de novograph haplotypes are assembled from reads (haplotype assembly).

  4. Extracts possible somatic or germline calls (events) from column wise pileup analysis.

  5. Calibrates read base qualities to account for sample-specific noise.

  6. Computes read likelihoods for each read/ haplotype pair.

  7. Performs variant calling by summing the genotype probabilities across all reads/haplotype pairs.

  8. Performs additional filtering to improve variant calling accuracy (see Filter Status).

Systematic Noise File

The DRAGEN TSO 500 ctDNA Analysis Software uses a systematic noise file to improve variant calling accuracy. The file indicates the statistical probability of noise at specific positions in the genome. Illumina has constructed the noise file using 60 normal cfDNA libraries. Regions where noise is common (eg, difficult to map regions) have higher noise values. The small variant caller penalizes those regions to reduce the probability of making false positive calls.

Systematic noise file accounts for site specific noise by estimating average allele frequency over multiple normal samples

Germline, Somatic and Clonal Hematopoiesis (CH) tagging

The Tumor Mutational Burden (TMB) module of DRAGEN TSO 500 ctDNA Analysis Software, predicts whether a small variant is of germline or somatic origin as well as whether the variant is associated with Clonal Hematopoiesis (CH). The results are output in the TMB Trace TSV and Small Variant VCF files.

Please review the TMB algorithm page for more details.

Outputs

The DRAGEN TSO 500 ctDNA Analysis Software produces several files with small variant calling results, including:

  • Combined Variant Output File, {SampleID}_CombinedVariantOutput.tsv

  • Small Variant VCF {SampleID}_hard-filtered.vcf

  • Small Variant Genome VCF {SAMPLE_ID}_hard-filtered.gvcf.gz

  • Small Variant Annotated JSON {SAMPLE_ID}_SmallVariants_Annotated.json.gz

Combined Variant Output File

File name: {SampleID}_CombinedVariantOutput.tsv

All variants with the FILTER field marked as PASS in the Small Variant Genome VCF are present in the Combined variant Output.

  • Gene information is only present for variants belonging to canonical transcripts that are within the Gene Allow List–Small Variants.

  • Transcript information is only present for variants belonging to canonical transcripts that are within the Gene Allow List–Small Variants.

Combined variant output produces small variants with blank fields in the following situations:

  • The variant has been matched to a canonical RefSeq transcript on an overlapping gene not targeted by TruSight Oncology 500 ctDNA.

  • The variant is located in a region designated iSNP, indel, or Flanking in the TST500_Manifest.bed file located in the Resources folder.

Small Variant VCF

File name: {SampleID}_hard-filtered.vcf

The Small Variant VCF file outputs all small variant calling results.

MNVs and Phased Variants

The small variant file contains both phased variants and all other small variants. The header sections from both the phased variant (complex) VCF and the small variant VCF are included in this merged VCF. Variants that are found for both phased variants and small variants are only displayed as phased variants.

Germline Status

The Small Variant VCF file contains predicted germline, somatic, and clonal hematopiesis (CH) variants that can be further filtered down using GermlineStatus in the INFO field. See this section for more details.

Filter Status

Variants can be filtered down using different tags assigned in the field FILTER as described in the table below.

ALT
FILTER
Note

.

PASS

WT.

., A, C, G, etc1

low_depth

Reference positions and non-passing variants with coverage below 1000X. For variant calls, low_depth is not applied when a position has a PASS filter.

A, C, G, etc1

PASS

PASS variants.

A, C, G, etc1

weak_evidence

Filtered variant candidate with low SQ score (< 2).

A, C, G, etc1

excluded_regions2

Position with high background noise. Not available for variant detection.

A, C, G, etc1

systematic_noise

Filtered variant candidate with low AQ score (< 20 for hotspots, < 60 for nonhotspots).

A, C, G, etc1

mapping_quality

Filtered variant candidate with low median mapping quality (< 30).

A, C, G, etc1

read_position

Filtered variant candidate showed bias clustered at fragment ends.

A, C, G, etc1

multiallelic

Filtered if there are two or more ALT alleles at this location.

A, C, G, etc1

low_frac_info_reads

Filtered if the fraction of informative reads is low (< 0.5).

1 Etc refers to other variant types not mentioned in the table.

2 This is a static list of regions compiled by Illumina. Email Illumina Technical Support for more information.

Small Variant Genome VCF

File name: {SAMPLE_ID}_hard-filtered.gvcf.gz

The small variant genome VCF file includes the variant call status for all positions in all targeted intervals.

Small Variant Annotated JSON

File name: {SAMPLE_ID}_SmallVariants_Annotated.json.gz

The small variants annotated file provides variant annotation information for all non-reference positions in the VCF, which includes non-pass variants. The variant consequence definition is available on the Sequence Ontology website.

All pass variant calls are annotated using the Illumina Annotation Engine (IAE), also known as Nirvana, with the following information (using the RefSeq transcript):

  • HGNC Gene

    • Transcript

    • Exon

    • Consequence

    • c.HGVS

    • p.HGVS

  • COSMIC

Last updated

Was this helpful?