The extended BQSR workflow
The previous section listed the most basic form of variant calling, that will deliver first results. Those results can already be sufficient. On the other hand, the GATK team recommends to also perform base quality score recalibration (BQSR). Therefore a second workflow was created, that can optionally be performed in succession to the first workflow. This workflow will use the results generated in the first, basic workflow to perform BQSR and thereby hopefully deliver further improved variant calls.
Just as before in the basic workflow, this section lists the major shell commands in their basic form to perform BQSR and improved variant calling. The actual commands implemented in the workflow will include additional options to obtain improved performance of the respective GATK application. Those optimizations are not covered here.
The BQSR workflow further processes some files generated in the basic variant calling workflow. The generation of those files has to be looked up in the previous workflow.
Overview of the workflow
This section gives a brief overview of the exact GATK tools that are used and the succession of their usage. Beginning from step 16, the workflow is basically a repetition of the variant calling as performed in the basic workflow.
Base quality optimization
gatk BaseRecalibrator
: generation of recalibration table for BQSRgatk ApplyBQSR
: actual base quality score recalibration of readsgatk BaseRecalibrator
: recalibration table after BQSR
gatk AnalyzeCovariates
: comparison of recalibrated bases
Variant calling
gatk HaplotypeCaller
: actual variant callinggatk GatherVcfs
: pooling of intervals (optional)gatk CombineGVCFs
: pooling of called individualsgatk GenotypeGVCFs
: genotyping of called variantsgatk SelectVariants
: separating of SNPs and indelsgatk VariantFiltration
: hard filtering of SNPs and indelsgatk SortVcf
: merging of SNPs and indelsgatk SelectVariants
: removal of filtered variants
Variant annotation
snpEff
: variant annotation
Base quality optimization
As the name implies base quality score recalibration (BQSR) is a processing step of the reads to optimize the given quality scores. During sequencing the base callers can introduce systematic errors, when judging the base quality (phred score). This step is supposed to improve those quality scores and therefore differentiation between real variants and just wrongly called bases. Further details are listed by the GATK team.
Initial analysis of base quality scores
1gatk BaseRecalibrator -R <reference_genome> -I <mapping_of_marked_duplicates.bam> \↵
2 --known-sites <called_variants.vcf.gz> -O <first_recalibration.table>
Change the given quality scores
1gatk ApplyBQSR -R <reference_genome> -I <mapping_of_marked_duplicates.bam> \↵
2 -bqsr <first_recalibration.table> -O <optimized_read_mapping.bam>
Analysis of recalibration effects
Second analysis of base quality scores, to judge the effect of the quality score recalibration.
1gatk BaseRecalibrator -R <reference_genome> -I <optimized_read_mapping.bam> \↵
2 --known-sites <called_variants.vcf.gz> -O <second_recalibration.table>
After the analysis of the improved mappings, the results before and after quality score optimization can be compared.
1gatk AnalyzeCovariates -before <first_recalibration.table> \↵
2 -after <second_recalibration.table> -plots <analysis_results.pdf>
Improved variant calling
The following steps 16 till 24 are basically identical to the variant calling steps as performed in the basic workflow, beginning with the HaplotypeCaller and following steps. After variant calling, annotation of the called variants is performed via SnpEff. A detailed description of the actual commands can be found in the basic workflow description.
DAG of the BQSR workflow
Again, the succession of rules (a directed acyclic graph - DAG) that are applied by Snakemake to evaluate the input data can be visualized. In the given example graph two data sets are evaluated. For each of the data sets variant calling through HaplotypeCaller is performed on three different intervals. Therefore in the given example, six HaplotypeCaller processes might be executed in parallel, given that sufficient hardware resources are available.
It has to be noted, that the shown graph cannot stand by itself. Previous data evaluation through the basic workflow has to be performed, as the generated results are further processed in the BQSR workflow.