The BQSR workflow

The workflow previously described, will already supply a fully annotated set of variants. To further refine the called variants, the GATK team recommends to perform base quality score recalibration (BQSR). Therefore BQSR was implemented in a second workflow, that can optionally be run in succession to the first workflow, to further improve the called variants through BQSR. The most obvious downside is, that the execution of a second workflow will increase the time till the final results are available, nearly doubling it. Also a second point should at least be considered. The called variants are used to refine the given data set itself. This self-improvement might at least potentially introduce a certain bias. Still this procedure is strongly recommended by the GATK team and should therefore be legit.

Setup & preparations

The BQSR workflow requires some additional applications. Especially GATK AnalyzeCovariates heavily relies on R and several R packages. The lack of those packages will result in error logs under logs/15_analyze_BQSR/<sample>_AnalyzeCovariates.log like the following:

Stderr: Error in library(gplots) : there is no package called ‘gplots’

Therefore a specialized Conda environment has to be created for the BQSR workflow. This Conda environment will include all applications used in the normal workflow, plus R and the required R packages. If the BQSR workflow shall be run anyways, this Conda environment can also be used for the normal workflow. It is created like this (the YAML file is found in the repository):

1conda env create --prefix /path/to/project_dir/BQSR_env --file BQSR_dependencies_mini.yml

Finally the new Conda environment has to be activated:

1conda activate /path/to/project_dir/BQSR_env

Now the actual workflow has to be copied from the repository and placed into the project directory:

/path/to/project_dir/variant_calling/SnakefileBQSR
/path/to/project_dir/variant_calling/configBQSR.yaml # optionally

Workflow usage

The BQSR workflow builds upon the normal workflow. As a result of this the normal variant calling has to be preformed first and the following files have to be created through this workflow:

/path/to/project_dir/variant_calling/03_mark_duplicates/<file_names>.bam
/path/to/project_dir/variant_calling/03_mark_duplicates/<file_names>.bam.bai
/path/to/project_dir/variant_calling/11_filtered_removed_VCF/variants_filtered.vcf.gz
/path/to/project_dir/variant_calling/interval_lists/<interval_xy>.list
/path/to/project_dir/variant_calling/processed_reference/<file_name>.fa.gz
/path/to/project_dir/variant_calling/processed_reference/<file_name>.fa.gz.fai
/path/to/project_dir/variant_calling/processed_reference/<file_name>.fa.gz.gzi
/path/to/project_dir/variant_calling/processed_reference/<file_name>.dict
/path/to/project_dir/variant_calling/snpEffDB/<directory_name>/genes.gff
/path/to/project_dir/variant_calling/snpEffDB/<directory_name>/sequences.fa.gz
/path/to/project_dir/variant_calling/snpEffDB/<directory_name>/snpEffectPredictor.bin

An initial dry run can be performed to test for any missing files, spelling mistakes and the like. In any case the Snakefile for the BQSR workflow must explicitly be specified (-s option), otherwise the normal workflow might be reexecuted:

1cd /path/to/project_dir/variant_calling/
2snakemake -np -s SnakefileBQSR

Finally the BQSR workflow can be executed. To achieve parallelization and thereby shorter runtimes, the number of used threads (--cores) can be specified (depends on the given infrastructure):

1snakemake -n --cores <threads> -s SnakefileBQSR

The workflow will create the following directories:

13_start_BQSR
14_apply_BQSR
15_analyze_BQSR
16_haplotypeCaller_2
17_gathered_samples_2
18_combined_calls_2
19_genotypeGVCFs_2
20_split_SNPs_Indels_2
21_hard_filtering_2
22_merged_filtered_VCF_2
23_filtered_removed_VCF_2
24_annotated_variants_2
logs/<various_sub_directories>
benchmarks/<various_sub_directories>

Optimized workflow execution

The workflow possesses different phases, which can be parallelized to variable degrees. Some rules might even be postponed to phases that cannot be parallelized as much. This helps in optimizing the overall runtime. This can be achieved by the --prioritize switch, that assigns highest priority to a given target rule and its direct dependencies:

1snakemake -n --cores <threads> -s SnakefileBQSR --prioritize genotypeGVCFs_2

As above mentioned, the BQSR workflow depends on the previous execution of the normal workflow. It is possible to run both workflows in direct succession. In this case the BQSR Conda environment has to be activated for both workflows. A single command can then be used to execute both workflows in direct succession:

1snakemake -p --cores <threads>; \↵
2snakemake -p --cores <threads> -s SnakefileBQSR

There may not be an unprotected Enter between the two commands. It’s also possible to write both commands in a single line.

Workflow configuration

The workflow will automatically detect the files, that were generated during the first workflow and further process those. So for the BQSR workflow the samples_and_read_groups.csv file is neither needed nor used.

There is one optional configuration file, named configBQSR.yaml. A template of this file can be found in the repository. It’s intended to configure the internal behavior of the workflow, mainly Java heap size and the number of garbage collection threads. This file is the equivalent to the config.yaml used in the first workflow.

 1# yaml file listing optionally available configuration
 2# options for OVarFlow BQSR
 3# no option nor the yaml file itself must be present
 4# here the default options of OVarFlow are listed
 5
 6ParallelGCThreads:
 7     BaseRecalibrator  : 2
 8     ApplyBQSR         : 2
 9     AnalyzeCovariates : 2
10     HaplotypeCaller   : 2
11     GatherIntervals   : 2
12     CombineGVCFs      : 2
13     GATKdefault       : 4
14
15heapSize:
16     BaseRecalibrator  : 2
17     ApplyBQSR         : 2
18     AnalyzeCovariates : 2
19     HaplotypeCaller   : 2
20     GatherIntervals   : 2
21     CombineGVCFs      : 2
22     GATKdefault       : 12
23
24Miscellaneous:
25     GATKtmpDir        : "GATK_tmp_dir"
26     HCnpHMMthreads    : 4
27     DebuggingYAML     : False
28     DebuggingSAMPLE   : False

The above template also documents the default settings used in the BQSR workflow. This workflow uses some additional GATK applications, which can be optimized through this configuration file. The values after the colon can be adjusted as follows:

heapSize

Range of accepted values: 1 - 40

The same general rules apply to this section as mentioned in the Configuration & adaptation section under The yaml file paragraph, concerning Java heap size.

ParallelGCThreads

Range of accepted values: 1 - 20

The same general rules apply to this section as mentioned in the Configuration & adaptation section under The yaml file paragraph, concerning Java GC threads.

Miscellaneous

Range of accepted values:

GATKtmpDir: /tmp[/subdirectory] or ./<name>[/subdirectory]
HaplotypeCaller intervals: 1 - 100
DebuggingYAML: False or True
DebuggingSAMPLE: False or True

In this section the directory used by GATK to store temporary data can be adjusted. The default is to use a directory GATK_tmp_dir within the project directory.

The number of native pair hmm threads used by GATK HaplotypeCaller can also be adjusted. A value of 1 can increase parallelization, meaning more HaplotypeCaller processes can run in parallel. While a value of 4 will give the quickest execution of the individual HaplotypeCaller process.

A debugging output of the settings made in this YAML file can be enabled, basically echoing the settings made.

Finally a debugging output of the input data that are processed during the analysis can be enabled.

As before the YAML file can be shortened only to those values that shall be changed, see the Configuration & adaptation section.

Container usage

As with the “normal workflow”, a container is available for the BQSR workflow. This container includes everything that is needed for the entire variant calling: the normal workflow, the BQSR workflow and all the required software. Of course this comes at the cost of a larger container. User that only intend to perform variant calling without BQSR can stick to the smaller container. In any case the usage of a container frees the user from the installation of Conda and creation of a Conda environment. All of this comes with the container.

As before there are plenty of ways to use the container. First of all Docker or Singularity can be used as a container technology. Then there are various ways to execute the workflow with the respective container technology. The following description is focused on Singularity, as it’s more straight forward than Docker. Still there are several ways to use Singularity containers. The below list ranges from highly automated to more manual usage. The latter allow for more control of the workflows, if needed.

Automatic start of the workflow

The entire variant calling including BQSR can be started with a single command. But first of all a project directory and a CSV configuration file have to be prepared. Those steps are described in detail under Conda & Snakemake usage => Preparing OVarFlow / The CSV configuration file. The workflow is then started via:

1singularity run --bind /path/to/project_dir/:/input/ OV_BQSR.sif

If all preparations were done correctly various log messages will appear starting with:

1No THREADS variable set. Using default settings.
2
3Using the following number of threads:
4  ->_xy_<-
5Starting OVarFlow now:
6
7Building DAG of jobs...
8...

In case that something is still missing or for instance a wrongly named annotation file in the CSV configuration file (here: SampleSeq.gff instead of SampleAnn.gff), an error message like the following might be shown:

1Building DAG of jobs...
2MissingInputException in line 851 of /snakemake/Snakefile:
3Missing input files for rule create_snpEff_db:
4REFERENCE_INPUT_DIR/SampleSeq.gff
5IndexError in line 12 of /snakemake/SnakefileBQSR:
6list index out of range
7  File "/snakemake/SnakefileBQSR", line 12, in <module>

There is a default value of used threads, equal to the amount of available cores (or threads) minus 4. The setting of an environment variable before starting Singularity enables modification of number of used threads:

1export THREADS=<desired_number_of_threads>
2singularity run --bind /path/to/project_dir/:/input/ OV_BQSR.sif

Manual start of the workflows

Also the contents of the container can be made available within a shell. Thereby the snakefiles of the two workflows are directly accessible.

1user@host:~$ singularity shell --bind /path/to/project_dir/:/input OV_BQSR.sif
2Singularity> snakemake -v
35.26.1
4Singularity> cd /input
5Singularity> snakemake -p --cores <n> -s /snakemake/Snakefile
6Singularity> snakemake -p --cores <n> -s /snakemake/SnakefileBQSR