The basic variant calling workflow
This section lists the most basic commands that are needed to perform variant calling. Therefore it is targeted at those people that are:
not interested in the usage of OVarFlow as a full-fledged workflow for variant calling, but are interested in the commands involved,
want to get a basic understanding of what’s going on under the hood of this workflow but get confused by the syntax of Snakemake and the Snakefile used,
novice users of GATK as GATK best practises can be confusing in the beginning.
This section lists the major shell commands in their most basic form, that are needed to perform variant calling with GATK. Essentially this will give you a very general overview of variant calling using GATK 4. It is highly recommended to also have at least a look at the section covering “GATK particularities”. This section tries to cover some of GATK’s very subtle usage issues which can have considerable effects on runtime and resource usage. Those issues won’t be covered here.
To make use of the description provided here, a basic understanding of the Unix shell bash (or similar) is required.
Overview of the workflow
In the sections below details of the exact variant calling workflow are outlined. This section is supposed to give a schematic and rather rough summary of the process. One might distinguish three phases:
Data pre-processing
fastqc
: quality control of reads (optional but recommended)bwa mem
: mapping to the reference genomegatk SortSam
: sorting of the readsgatk MarkDuplicates
: as the name implies
Preparation of variants
gatk HaplotypeCaller
: actual variant callinggatk GatherVcfs
: pooling of intervals (optional)gatk CombineGVCFs
: pooling of called individualsgatk GenotypeGVCFs
: genotyping of the called variantsgatk SelectVariants
: separating of SNPs and indelsgatk VariantFiltration
: hard filtering of SNPs and indelsgatk SortVcf
: merging of SNPs and indelsgatk SelectVariants
: removal of filtered variants
Variant annotation
snpEff
: variant annotation
Preparing the workflow
For the variant calling workflow outlined below, several files and databases have to be prepared in the first place. This is mostly related to the preprocessing of reference genomes and annotations. It is advisable to perform those steps before the actual data evaluation, as the workflow might fail if one database cannot be created. This is especially annoying if it happens in the last step, where a database for the usage of snpEff
(which is used for the functional annotation of detected variants) is required.
Compression of the reference
To save some disc space reference genomes should usually be compressed. This is often accomplished via gzip
. The drawback of this program is, that compression cannot be indexed easily. Therefore the reference genome should be compressed with bgzip
, which is very similar to gzip
.
1gunzip --to-stdout <ref_genome.gz> | bgzip > <ref_genome_recompressed.gz>
Indexing the reference genome
1samtools faidx <ref_genome_recompressed.gz>
Creating a bwa index
1bwa index <ref_genome_recompressed.gz>
A dictonary for GATK
1gatk CreateSequenceDictionary -R <ref_genome_recompressed.gz>
A database for snpEff
From personal experience it is advisable to use an annotation from the RefSeq in gff format. Processing of those files with snpEff
has been unproblematic so far. Gtf formatted files as available from Ensembl might not be parsable by snpEff
. When in doubt try creating the snpEff database first.
For snpEff
to read the annotation, it shouldn’t be compressed and be named genes.gff. Furthermore the reference should be copied to the same directory and named sequences.fa.gz. Compression of the reference sequence is fine. The directory to which the files are copied should be named after the reference genome you’re using (_name_
), but without any file specific extension.
1gunzip --to-stdout <ref_anno.gz> > <snpEffDB/_name_/genes.gff>
2cp <ref_sequence.gz> <snpEffDB/_name_/sequence.fa.gz>
snpEff
can create global databases, which will reside in your home directory by default. It is advisable to create an annotation database in your local project directory.
1snpEff -Xmx12g build -dataDir ${PWD}/snpEffDB \↵
2 -configOption <_name_>.genome=<_name_> \↵
3 -gff3 -v <_name_>
Usage of snpEff
can be daunting in the beginning. Fortunately the online documentation is quite comprehensive.
The actual variant calling
For this example procedure it is assumed, that for each individual to be analyzed, all reads are contained in only one fastq file containing the forward reads (R1) and one file containing the reverse reads (R2). If there are several files for forward and reverse reads, those have to be merged in advance.
Quality control via FastQC
1fastqc -o <output dir> -f fastq <input file.fastq.gz>
Mapping of fastq files
When using GATK bwa mem
is probably the most widely used mapper. Of course an index database has to be created in the first place (idxbase
). Directly piping into samtools will produce compressed bam files. Writing stderr to separate files will preserve any potential error or log messages.
1bwa mem -M -t <number_of_threads> -R <read_group_tag> <idxbase> \↵
2 <forward_reads.fastq> <reverse_reads.fastq> 2> <bwa_log_message> | \↵
3 samtools view -v /dev/fd/0 -o <output_mapping.bam> 2> <samtools_log_message>
Sorting of mapped bam files
Here as well as in subsequent steps the “output mapping” of the last command will act as input of this step.
1gatk SortSam -I <input_mapping.bam> -SO coordinate -O <output_mapping.bam>
Marking of duplicated reads
1gatk MarkDuplicates -I <input_mapping.bam> -O <output_mapping.bam> \↵
2 -M <metrics.txt>
Creating an index of marked mappings
1samtools index <output_mapping.bam>
Variant calling with HaplotypeCaller
Optionally OVarFlow is capable of calling variants on several intervals per individual. By executing the HaplotypeCaller on intervals a high degree of parallelization is achieved and analysis times are reduced considerably.
When applying a *.gz suffix to the output genomics variant call format file (GVCF; ending: g.vcf) the resulting file will automatically be compressed.
1gatk HaplotypeCaller -ERC GVCF -I <input_mapping.bam> -R <reference_genome> \↵
2 -O <interval_xy.g.vcf.gz>
Gathering of intervals per individual
If variant calling was performed in parallel on intervals, the resulting intervals have to be combined.
1gatk GatherVcfs -O <individual_xy.g.vcf.gz> -I <interval_1.g.vcf.gz> \↵
2 -I <interval_2.g.vcf.gz> ... -I <interval_n.g.vcf.gz>
This step could also be performed by CombineGVCFs
, but GatherVcfs
is considerably quicker. A downside of GatherVcfs
is, that it can only handle intervals that preserve the initial order of contigs. This issue is automatically taken care of in the workflow.
Combining the individual variant files
For every processed individual a single file is created. Those files have to be combined.
1gatk CombineGVCFs -O <combinde_variants.g.vcf.gz> -R <reference_genome> \↵
2 -V <individual_1.g.vcf.g> -V <individual_2.g.vcf.gz> ... -V <individual_n.g.vcf.gz>
Processing the Genotype
Even though the HaplotypeCaller writes genotype information into the initial GVCF file, this information is lost when merging the individual GVCF files. By performing joint genotyping over several individuals the genotyping accuracy is improved and the genotype information is restored.
1gatk GenotypeGVCFs -R <reference_genome> -V <input.g.vcf.gz> -O <output.vcf.gz>
Quality filtering of variants
HaplotypeCaller will also create some false positive variant calls. GATK offers two approaches to reduce the amount of false positives.
Variant quality score recalibration (VQSR): This approach needs a so-called “truth set” of already known variants for the respective organism. This truth set is used in a machine learning approach, to learn the profile of likely real variants. This method is most feasible with well studied organisms, exhibiting highly reliable variant datasets.
Hard filtering: In hard filtering solid thresholds are applied onto the quality parameters of each called variant. Variants not meeting those quality thresholds will be discarded.
A third approach shall also be mentioned. In a kind of an iterative process first hard filtering is used to create an initial data set of variants. This initial variant set is then used in a second step to perform VQSR. This approach has the potential drawback of introducing a bias within the hard filtering which is then learned and applied in the VQSR. Therefore a single hard filtering step was used in this workflow.
Separating SNPs and indels
Different thresholds have to be applied for SNPs and indels. They have to be separated in the first step.
1gatk SelectVariants -V <input.vcf.gz> -select-type SNP -O <output_snps.vcf.gz>
1gatk SelectVariants -V <input.vcf.gz> -select-type INDEL -select-type MIXED \↵
2 -O <output_indels.vcf.gz>
It is imported to combine the options -select-type INDEL
and -select-type MIXED
as otherwise positions showing both types of variants will be lost.
Hard filtering
In hard filtering various filters are applied. It is important not to chain the single filters via the logical or operator (||) (see GATK). In this case the entire filter would pass as soon as a single filter condition is not fulfilled.
1gatk VariantFiltration -V <input_snps.vcf.gz> \↵
2 -filter 'QD <2.0' --filter-name 'QD2' \↵
3 -filter 'QUAL < 30.0' --filter-name 'QUAL30' \↵
4 -filter 'SOR > 3.0' --filter-name 'SOR3' \↵
5 -filter 'FS > 60.0' --filter-name 'FS60' \↵
6 -filter 'MQ < 40.0' --filter-name 'MQ40' \↵
7 -filter 'MQRankSum < -12.5' --filter-name 'MQRankSum-12.5' \↵
8 -filter 'ReadPosRankSum < -8.0' --filter-name 'ReadPosRankSum-8' \↵
9 -O <output_snps.vcf.gz>
1gatk VariantFiltration -V <input_indels.vcf.gz> \↵
2 -filter 'QD < 2.0' --filter-name 'QD2' \↵
3 -filter 'QUAL < 30.0' --filter-name 'QUAL30' \↵
4 -filter 'FS > 200.0' --filter-name 'FS200' \↵
5 -filter 'ReadPosRankSum < -20.0' --filter-name 'ReadPosRankSum-20' \↵
6 -O <output_indels.vcf.gz>
Merging SNPs and indels
After the filtering step SNPs and indels can be reunified.
1gatk SortVcf -I <input_snps.vcf.gz> -I <input_indels.vcf.gz> -O <sorted_variants.vcf.gz>
Remove filtered variants
Hard filtering will only tag variants not meeting the filtering criteria. Still they have to be removed from the dataset. This step basically creates your finished data set containing the called variants.
1gatk SelectVariants -V <sorted_variants.vcf.gz> -O <filtered_variants.vcf.gz> \↵
2 --exclude-filtered true
Annotating the variants
A further tool might be used to create a functional annotation of your variants. Based upon a valid genome annotation, functional annotation of the variants will determine if the respective variant is for instance within a coding region of a gene. Also the effect of a variant will be computed, telling if a synonymous, non synonymous or nonsense mutation is given through the respective variant. The program snpEff
is one option between many.
1snpEff -Xmx12g <_name_> <filtered_variants.vcf.gz> \↵
2 -dataDir <path/to/snpEffDB> \↵
3 -configOption <_name_>.genome=<_name_> \↵
4 -stats <snpEff_summary.html> | \↵
5 bgzip > <annotated_variants.vcf.gz>
The option -Xmx12g
will increase the memory available to the Java virtual machine. This might be needed for larger genomes. Otherwise it’s optional and can be changed or bypassed.
DAG of the workflow
Snakemake creates a directed acyclic graph (DAG) of the workflow, within the so-called DAG phase. This DAG can be visualized using the dot
command. To give an example of the workflow, two input datasets (each with forward and reverse reads) were analyzed using three HaplotypeCaller intervals. The resulting DAG of this workflow is shown in the following figure. Each rounded box represents the execution of a single Snakemake rule (knot of the graph) the arrows show the succession of the rules (edges of the graph).