The extended BQSR workflow

The previous section listed the most basic form of variant calling, that will deliver first results. Those results can already be sufficient. On the other hand, the GATK team recommends to also perform base quality score recalibration (BQSR). Therefore a second workflow was created, that can optionally be performed in succession to the first workflow. This workflow will use the results generated in the first, basic workflow to perform BQSR and thereby hopefully deliver further improved variant calls.

Just as before in the basic workflow, this section lists the major shell commands in their basic form to perform BQSR and improved variant calling. The actual commands implemented in the workflow will include additional options to obtain improved performance of the respective GATK application. Those optimizations are not covered here.

The BQSR workflow further processes some files generated in the basic variant calling workflow. The generation of those files has to be looked up in the previous workflow.

Overview of the workflow

This section gives a brief overview of the exact GATK tools that are used and the succession of their usage. Beginning from step 16, the workflow is basically a repetition of the variant calling as performed in the basic workflow.

Base quality optimization
1. gatk BaseRecalibrator: generation of recalibration table for BQSR
2. gatk ApplyBQSR: actual base quality score recalibration of reads
3. gatk BaseRecalibrator: recalibration table after BQSR
  gatk AnalyzeCovariates: comparison of recalibrated bases
Variant calling
1. gatk HaplotypeCaller: actual variant calling
2. gatk GatherVcfs: pooling of intervals (optional)
3. gatk CombineGVCFs: pooling of called individuals
4. gatk GenotypeGVCFs: genotyping of called variants
5. gatk SelectVariants: separating of SNPs and indels
6. gatk VariantFiltration: hard filtering of SNPs and indels
7. gatk SortVcf: merging of SNPs and indels
8. gatk SelectVariants: removal of filtered variants
Variant annotation
1. snpEff: variant annotation

Base quality optimization

As the name implies base quality score recalibration (BQSR) is a processing step of the reads to optimize the given quality scores. During sequencing the base callers can introduce systematic errors, when judging the base quality (phred score). This step is supposed to improve those quality scores and therefore differentiation between real variants and just wrongly called bases. Further details are listed by the GATK team.

Initial analysis of base quality scores

gatk BaseRecalibrator -R <reference_genome> -I <mapping_of_marked_duplicates.bam> \↵
  --known-sites <called_variants.vcf.gz> -O <first_recalibration.table>

Change the given quality scores

gatk ApplyBQSR -R <reference_genome> -I <mapping_of_marked_duplicates.bam> \↵
  -bqsr <first_recalibration.table> -O <optimized_read_mapping.bam>

Analysis of recalibration effects

Second analysis of base quality scores, to judge the effect of the quality score recalibration.

gatk BaseRecalibrator -R <reference_genome> -I <optimized_read_mapping.bam> \↵
  --known-sites <called_variants.vcf.gz> -O <second_recalibration.table>

After the analysis of the improved mappings, the results before and after quality score optimization can be compared.

gatk AnalyzeCovariates -before <first_recalibration.table> \↵
  -after <second_recalibration.table> -plots <analysis_results.pdf>

Improved variant calling

The following steps 16 till 24 are basically identical to the variant calling steps as performed in the basic workflow, beginning with the HaplotypeCaller and following steps. After variant calling, annotation of the called variants is performed via SnpEff. A detailed description of the actual commands can be found in the basic workflow description.

DAG of the BQSR workflow

Again, the succession of rules (a directed acyclic graph - DAG) that are applied by Snakemake to evaluate the input data can be visualized. In the given example graph two data sets are evaluated. For each of the data sets variant calling through HaplotypeCaller is performed on three different intervals. Therefore in the given example, six HaplotypeCaller processes might be executed in parallel, given that sufficient hardware resources are available.

It has to be noted, that the shown graph cannot stand by itself. Previous data evaluation through the basic workflow has to be performed, as the generated results are further processed in the BQSR workflow.