# Small Variants

## Small Variant Calling and Filtering

The DRAGEN TSO 500 ctDNA Analysis Software supports calling SNVs, indels, MNVs, and delins from cfDNA samples by using mapped and aligned DNA reads from a plasma sample as input.

Variants are detected via both column wise pileup analysis and local *de novo* assembly of haplotypes. The *de novo* haplotypes allow the detection of much larger insertions and deletions than possible through column wise pileup analysis only. Insertions and deletions called by the TSO 500 ctDNA analysis software do not have a size limitation but has different level of performance testing depending on the length, see [Performance Testing page](https://help.tso500software.illumina.com/performance-testing) for more details.

To call variants via local *de novo* assembly of haplotypes in active regions, haplotypes are first generated with de Bruijn graph. The likelihood of a read supporting a haplotype is calculated using a Paired Hidden Markov Model. **Somatic Score (SQ)** is calculated as the joint posterior probability that a variant is present in the sample. For each variant candidate, background noise at the same site is taken into account using a [systematic noise file](#systematic-noise-file). A p-value is calculated using the observed variant depth, total depth, and the systematic noise using binomial distribution and then converted to a variant **Quality Score (AQ)**.

Variants are called if SQ >= 2 and AQ >= 20 for variants present in Catalogue of Somatic Mutations in Cancer (COSMIC) with count > 50 (hotspots) or if SQ >= 2 and AQ >= 60 for remaining sites (nonhotspots).

In addition, DRAGEN uses the de novo assembly to detect SNVs, insertions, and deletions that are co-phased and part of the same haplotypes. Any such co-phased variants that are within a window of 15 bp are then reassembled into complex variants (MNVs and delins).

The pipeline makes no ploidy assumptions, enabling detection of low-frequency alleles.

DRAGEN small variant calling includes the following steps:

1. Detects regions with sufficient read coverage (callable regions).
2. Detects regions where the reads deviate from the reference and there is a possibility of a germline or somatic call (active regions).
3. Assembles de novograph haplotypes are assembled from reads (haplotype assembly).
4. Extracts possible somatic or germline calls (events) from column wise pileup analysis.
5. Calibrates read base qualities to account for sample-specific noise.
6. Computes read likelihoods for each read/ haplotype pair.
7. Performs variant calling by summing the genotype probabilities across all reads/haplotype pairs.
8. Performs additional filtering to improve variant calling accuracy (see [#filter-status](#filter-status "mention")).

## Systematic Noise File

The DRAGEN TSO 500 ctDNA Analysis Software uses a systematic noise file to improve variant calling accuracy. The file indicates the statistical probability of noise at specific positions in the genome. Illumina has constructed noise files using 40-60 normal cfDNA libraries that are sequencer specific. Regions where noise is common (eg, difficult to map regions) have higher noise values. The small variant caller penalizes those regions to reduce the probability of making false positive calls.

<figure><img src="https://3845108255-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F7XRWgkRPkhoHXVslBqXD%2Fuploads%2Fgit-blob-7d05b5d25664021743ea783b2f863f483c71151c%2Fimage%20(10).png?alt=media" alt=""><figcaption><p>Systematic noise file accounts for site specific noise by estimating average allele frequency<br>over multiple normal samples</p></figcaption></figure>

## Germline, Somatic and Clonal Hematopoiesis (CH) tagging

The Tumor Mutational Burden (TMB) module of DRAGEN TSO 500 ctDNA Analysis Software, predicts whether a small variant is of germline or somatic origin as well as whether the variant is associated with Clonal Hematopoiesis (CH). The results are output in the TMB Trace TSV and [Small Variant VCF files](#small-variant-vcf).

Please review the [TMB algorithm page](https://help.tso500software.illumina.com/dragen-tso-500-ctdna-guides/dragen-tso-500-ctdna-v2.6/tmb#id-3.-germline-variant-identification) for more details.

{% hint style="danger" %}
Variant statuses (somatic, germline, clonal hematopoiesis (CH) variant) are predictions intended for TMB calculation. Use caution if using them separately as their performance has not been tested outside of the TMB algorithm.
{% endhint %}

## Outputs

The DRAGEN TSO 500 ctDNA Analysis Software produces several files with small variant calling results, including:

* Combined Variant Output File, `{SampleID}_CombinedVariantOutput.tsv`
* Small Variant VCF `{SampleID}_hard-filtered.vcf`
* Small Variant Genome VCF `{SAMPLE_ID}_hard-filtered.gvcf.gz`
* Small Variant Annotated JSON `{SAMPLE_ID}`\_SmallVariants\_Annotated.json.gz

### Combined Variant Output File

File name: `{SampleID}_CombinedVariantOutput.tsv`

All variants with the FILTER field marked as PASS in the Small Variant Genome VCF are present in the Combined variant Output.

* Gene information is only present for variants belonging to canonical transcripts that are within the Gene Allow List–Small Variants.
* Transcript information is only present for variants belonging to canonical transcripts that are within the Gene Allow List–Small Variants.

Combined variant output produces small variants with blank fields in the following situations:

* The variant has been matched to a canonical RefSeq transcript on an overlapping gene not targeted by TruSight Oncology 500 ctDNA.
* The variant is located in a region designated iSNP, indel, or Flanking in the `TST500_Manifest.bed` file located in the Resources folder.

### Small Variant VCF

File name: `{SampleID}_hard-filtered.vcf`

The Small Variant VCF file outputs all small variant calling results.

#### MNVs and Phased Variants

The small variant file contains both phased variants and all other small variants. The header sections from both the phased variant (complex) VCF and the small variant VCF are included in this merged VCF. Variants that are found for both phased variants and small variants are only displayed as phased variants.

#### Germline Status

The Small Variant VCF file contains predicted germline, somatic, and clonal hematopiesis (CH) variants that can be further filtered down using *GermlineStatus* in the INFO field. See [this section](#germline-somatic-and-clonal-hematopoiesis-ch-tagging) for more details.

#### Filter Status

Variants can be filtered down using different tags assigned in the field FILTER as described in the table below.

<table><thead><tr><th width="250.1796875">Filter</th><th width="547.9453125">Description</th></tr></thead><tbody><tr><td>base_quality</td><td>Site filtered because median base quality of alt reads at this locus does not meet threshold</td></tr><tr><td>filtered_reads</td><td>Site filtered because too large a fraction of reads have been filtered out</td></tr><tr><td>fragment_length</td><td>Site filtered because absolute difference between the median fragment length of alt reads and median fragment length of ref reads at this locus exceeds threshold</td></tr><tr><td>low_depth</td><td>Site filtered because the read depth is too low (&#x3C;1000)</td></tr><tr><td>low_frac_info_reads</td><td>Site filtered because the fraction of informative reads is below threshold (&#x3C;0.5)</td></tr><tr><td>low_normal_depth</td><td>Site filtered because the normal sample read depth is too low</td></tr><tr><td>long_indel</td><td>Site filtered because the indel length is too long</td></tr><tr><td>mapping_quality</td><td>Site filtered because median mapping quality of alt reads at this locus does not meet threshold (&#x3C;30)</td></tr><tr><td>multiallelic</td><td>Site filtered because more than two alt alleles pass tumor LOD</td></tr><tr><td>non_homref_normal</td><td>Site filtered because the normal sample genotype is not homozygous reference</td></tr><tr><td>no_reliable_supporting_read</td><td>Site filtered because no reliable supporting somatic read exists</td></tr><tr><td>panel_of_normals</td><td>Seen in at least one sample in the panel of normals vcf</td></tr><tr><td>read_position</td><td>Site filtered because median of distances between start/end of read and this locus is below threshold</td></tr><tr><td>RMxNRepeatRegion</td><td>Site filtered because all or part of the variant allele is a repeat of the reference</td></tr><tr><td>str_contraction</td><td>Site filtered due to suspected PCR error where the alt allele is one repeat unit less than the reference</td></tr><tr><td>too_few_supporting_reads</td><td>Site filtered because there are too few supporting reads in the tumor sample</td></tr><tr><td>weak_evidence</td><td>Somatic variant score does not meet threshold (SQ &#x3C; 2)</td></tr><tr><td>systematic_noise</td><td>Site filtered based on evidence of systematic noise in normals Candidate has low AQ Score:<br>AQ &#x3C; 20 for variants with COSMIC count ≥ 50<br>AQ &#x3C; 60 for all other sites</td></tr><tr><td>excluded_regions</td><td>Site overlaps with vc excluded regions bed<sup>2</sup></td></tr></tbody></table>

<sup>2</sup> This is a static list of regions compiled by Illumina. Email Illumina Technical Support for more information.

### Small Variant Genome VCF

File name: `{SAMPLE_ID}_hard-filtered.gvcf.gz`

The small variant genome VCF file includes the variant call status for all positions in all targeted intervals.

### Small Variant Annotated JSON

File name: `{SAMPLE_ID}`\_SmallVariants\_Annotated.json.gz

The small variants annotated file provides variant annotation information for all non-reference positions in the VCF, which includes non-pass variants. The variant consequence definition is available on the [Sequence Ontology website](http://www.sequenceontology.org/).

All pass variant calls are annotated using the Illumina Annotation Engine (IAE), also known as Nirvana, with the following information (using the RefSeq transcript):

* HGNC Gene
  * Transcript
  * Exon
  * Consequence
  * c.HGVS
  * p.HGVS
* COSMIC
