The basic variant calling workflow

This section lists the most basic commands that are needed to perform variant calling. Therefore it is targeted at those people that are:

not interested in the usage of OVarFlow as a full-fledged workflow for variant calling, but are interested in the commands involved,
want to get a basic understanding of what’s going on under the hood of this workflow but get confused by the syntax of Snakemake and the Snakefile used,
novice users of GATK as GATK best practises can be confusing in the beginning.

This section lists the major shell commands in their most basic form, that are needed to perform variant calling with GATK. Essentially this will give you a very general overview of variant calling using GATK 4. It is highly recommended to also have at least a look at the section covering “GATK particularities”. This section tries to cover some of GATK’s very subtle usage issues which can have considerable effects on runtime and resource usage. Those issues won’t be covered here.

To make use of the description provided here, a basic understanding of the Unix shell bash (or similar) is required.

Overview of the workflow

In the sections below details of the exact variant calling workflow are outlined. This section is supposed to give a schematic and rather rough summary of the process. One might distinguish three phases:

Data pre-processing
1. fastqc: quality control of reads (optional but recommended)
2. bwa mem: mapping to the reference genome
3. gatk SortSam: sorting of the reads
4. gatk MarkDuplicates: as the name implies
Preparation of variants
1. gatk HaplotypeCaller: actual variant calling
2. gatk GatherVcfs: pooling of intervals (optional)
3. gatk CombineGVCFs: pooling of called individuals
4. gatk GenotypeGVCFs: genotyping of the called variants
5. gatk SelectVariants: separating of SNPs and indels
6. gatk VariantFiltration: hard filtering of SNPs and indels
7. gatk SortVcf: merging of SNPs and indels
8. gatk SelectVariants: removal of filtered variants
Variant annotation
1. snpEff: variant annotation

Preparing the workflow

For the variant calling workflow outlined below, several files and databases have to be prepared in the first place. This is mostly related to the preprocessing of reference genomes and annotations. It is advisable to perform those steps before the actual data evaluation, as the workflow might fail if one database cannot be created. This is especially annoying if it happens in the last step, where a database for the usage of snpEff (which is used for the functional annotation of detected variants) is required.

Compression of the reference

To save some disc space reference genomes should usually be compressed. This is often accomplished via gzip. The drawback of this program is, that compression cannot be indexed easily. Therefore the reference genome should be compressed with bgzip, which is very similar to gzip.

1gunzip --to-stdout <ref_genome.gz> | bgzip > <ref_genome_recompressed.gz>

Indexing the reference genome

1samtools faidx <ref_genome_recompressed.gz>

Creating a bwa index

1bwa index <ref_genome_recompressed.gz>

A dictonary for GATK

1gatk CreateSequenceDictionary -R <ref_genome_recompressed.gz>

A database for snpEff

From personal experience it is advisable to use an annotation from the RefSeq in gff format. Processing of those files with snpEff has been unproblematic so far. Gtf formatted files as available from Ensembl might not be parsable by snpEff. When in doubt try creating the snpEff database first.

For snpEff to read the annotation, it shouldn’t be compressed and be named genes.gff. Furthermore the reference should be copied to the same directory and named sequences.fa.gz. Compression of the reference sequence is fine. The directory to which the files are copied should be named after the reference genome you’re using (_name_), but without any file specific extension.

gunzip --to-stdout <ref_anno.gz> > <snpEffDB/_name_/genes.gff>
cp <ref_sequence.gz> <snpEffDB/_name_/sequence.fa.gz>

snpEff can create global databases, which will reside in your home directory by default. It is advisable to create an annotation database in your local project directory.

snpEff -Xmx12g build -dataDir ${PWD}/snpEffDB \↵
  -configOption <_name_>.genome=<_name_> \↵
  -gff3 -v <_name_>

Usage of snpEff can be daunting in the beginning. Fortunately the online documentation is quite comprehensive.

The actual variant calling

For this example procedure it is assumed, that for each individual to be analyzed, all reads are contained in only one fastq file containing the forward reads (R1) and one file containing the reverse reads (R2). If there are several files for forward and reverse reads, those have to be merged in advance.

Quality control via FastQC

1fastqc -o <output dir> -f fastq <input file.fastq.gz>

Mapping of fastq files

When using GATK bwa mem is probably the most widely used mapper. Of course an index database has to be created in the first place (idxbase). Directly piping into samtools will produce compressed bam files. Writing stderr to separate files will preserve any potential error or log messages.

bwa mem -M -t <number_of_threads> -R <read_group_tag> <idxbase> \↵
  <forward_reads.fastq> <reverse_reads.fastq> 2> <bwa_log_message> | \↵
  samtools view -v /dev/fd/0 -o <output_mapping.bam> 2> <samtools_log_message>

Sorting of mapped bam files

Here as well as in subsequent steps the “output mapping” of the last command will act as input of this step.

1gatk SortSam -I <input_mapping.bam> -SO coordinate -O <output_mapping.bam>

Marking of duplicated reads

gatk MarkDuplicates -I <input_mapping.bam> -O <output_mapping.bam> \↵
  -M <metrics.txt>

Creating an index of marked mappings

1samtools index <output_mapping.bam>

Variant calling with HaplotypeCaller

Optionally OVarFlow is capable of calling variants on several intervals per individual. By executing the HaplotypeCaller on intervals a high degree of parallelization is achieved and analysis times are reduced considerably.

When applying a *.gz suffix to the output genomics variant call format file (GVCF; ending: g.vcf) the resulting file will automatically be compressed.

gatk HaplotypeCaller -ERC GVCF -I <input_mapping.bam> -R <reference_genome> \↵
  -O <interval_xy.g.vcf.gz>

Gathering of intervals per individual

If variant calling was performed in parallel on intervals, the resulting intervals have to be combined.

gatk GatherVcfs -O <individual_xy.g.vcf.gz> -I <interval_1.g.vcf.gz> \↵
  -I <interval_2.g.vcf.gz> ... -I <interval_n.g.vcf.gz>

This step could also be performed by CombineGVCFs, but GatherVcfs is considerably quicker. A downside of GatherVcfs is, that it can only handle intervals that preserve the initial order of contigs. This issue is automatically taken care of in the workflow.

Combining the individual variant files

For every processed individual a single file is created. Those files have to be combined.

gatk CombineGVCFs -O <combinde_variants.g.vcf.gz> -R <reference_genome>  \↵
  -V <individual_1.g.vcf.g> -V <individual_2.g.vcf.gz> ... -V <individual_n.g.vcf.gz>

Processing the Genotype

Even though the HaplotypeCaller writes genotype information into the initial GVCF file, this information is lost when merging the individual GVCF files. By performing joint genotyping over several individuals the genotyping accuracy is improved and the genotype information is restored.

gatk GenotypeGVCFs -R <reference_genome> -V <input.g.vcf.gz> -O <output.vcf.gz>

Quality filtering of variants

HaplotypeCaller will also create some false positive variant calls. GATK offers two approaches to reduce the amount of false positives.

Variant quality score recalibration (VQSR): This approach needs a so-called “truth set” of already known variants for the respective organism. This truth set is used in a machine learning approach, to learn the profile of likely real variants. This method is most feasible with well studied organisms, exhibiting highly reliable variant datasets.
Hard filtering: In hard filtering solid thresholds are applied onto the quality parameters of each called variant. Variants not meeting those quality thresholds will be discarded.

A third approach shall also be mentioned. In a kind of an iterative process first hard filtering is used to create an initial data set of variants. This initial variant set is then used in a second step to perform VQSR. This approach has the potential drawback of introducing a bias within the hard filtering which is then learned and applied in the VQSR. Therefore a single hard filtering step was used in this workflow.

Separating SNPs and indels

Different thresholds have to be applied for SNPs and indels. They have to be separated in the first step.

1gatk SelectVariants -V <input.vcf.gz> -select-type SNP -O <output_snps.vcf.gz>

gatk SelectVariants -V <input.vcf.gz> -select-type INDEL -select-type MIXED  \↵
  -O <output_indels.vcf.gz>

It is imported to combine the options -select-type INDEL and -select-type MIXED as otherwise positions showing both types of variants will be lost.

Hard filtering

In hard filtering various filters are applied. It is important not to chain the single filters via the logical or operator (||) (see GATK). In this case the entire filter would pass as soon as a single filter condition is not fulfilled.

gatk VariantFiltration -V <input_snps.vcf.gz> \↵
  -filter 'QD <2.0' --filter-name 'QD2' \↵
  -filter 'QUAL < 30.0' --filter-name 'QUAL30' \↵
  -filter 'SOR > 3.0' --filter-name 'SOR3' \↵
  -filter 'FS > 60.0' --filter-name 'FS60' \↵
  -filter 'MQ < 40.0' --filter-name 'MQ40' \↵
  -filter 'MQRankSum < -12.5' --filter-name 'MQRankSum-12.5' \↵
  -filter 'ReadPosRankSum < -8.0' --filter-name 'ReadPosRankSum-8' \↵
  -O <output_snps.vcf.gz>

gatk VariantFiltration -V <input_indels.vcf.gz> \↵
  -filter 'QD < 2.0' --filter-name 'QD2' \↵
  -filter 'QUAL < 30.0' --filter-name 'QUAL30' \↵
  -filter 'FS > 200.0' --filter-name 'FS200' \↵
  -filter 'ReadPosRankSum < -20.0' --filter-name 'ReadPosRankSum-20' \↵
  -O <output_indels.vcf.gz>

Merging SNPs and indels

After the filtering step SNPs and indels can be reunified.

gatk SortVcf -I <input_snps.vcf.gz> -I <input_indels.vcf.gz> -O <sorted_variants.vcf.gz>

Remove filtered variants

Hard filtering will only tag variants not meeting the filtering criteria. Still they have to be removed from the dataset. This step basically creates your finished data set containing the called variants.

gatk SelectVariants -V <sorted_variants.vcf.gz> -O <filtered_variants.vcf.gz> \↵
  --exclude-filtered true

Annotating the variants

A further tool might be used to create a functional annotation of your variants. Based upon a valid genome annotation, functional annotation of the variants will determine if the respective variant is for instance within a coding region of a gene. Also the effect of a variant will be computed, telling if a synonymous, non synonymous or nonsense mutation is given through the respective variant. The program snpEff is one option between many.

snpEff -Xmx12g <_name_> <filtered_variants.vcf.gz> \↵
  -dataDir <path/to/snpEffDB> \↵
  -configOption <_name_>.genome=<_name_> \↵
  -stats <snpEff_summary.html> | \↵
  bgzip > <annotated_variants.vcf.gz>

The option -Xmx12g will increase the memory available to the Java virtual machine. This might be needed for larger genomes. Otherwise it’s optional and can be changed or bypassed.

DAG of the workflow

Snakemake creates a directed acyclic graph (DAG) of the workflow, within the so-called DAG phase. This DAG can be visualized using the dot command. To give an example of the workflow, two input datasets (each with forward and reverse reads) were analyzed using three HaplotypeCaller intervals. The resulting DAG of this workflow is shown in the following figure. Each rounded box represents the execution of a single Snakemake rule (knot of the graph) the arrows show the succession of the rules (edges of the graph).