GATK Pitfalls

Variant calling is no trivial task, involving many different applications. Not only is the usage and interaction of those applications often far from obvious, but also especially some GATK tools possess some very intricate pitfalls. Dealing with those for the first time can be very time consuming. During the development of OVarFlow many of those pitfalls were unraveled and are now taken care of by the workflow. Still the most annoying and time consuming issues shall be explained here.

gatk SortSam --TMP_DIR

SortSam creates a lot of temporary files of the form /tmp/<user_name>/`sortingcollection.nnnnnnnnnnnnnnnnnnn.tmp (with n being any number). Depending on the size of the input data this can easily add up to dozens of gigabytes, thereby quickly consuming space under /tmp. Depending on the partition scheme further problems can result from this. Also in case of abortion of SortSam, those temporary files are not automatically deleted. In those cases manual deletion or user lockout might be required.

To circumvent this problem OVarFlow stores temporary files under /path/to/project_dir/GATK_TMP_DIR/. This is archived by SortSam’s --TMP_DIR <directory> option. As the project directory has to provide reasonable amounts of storage, large temporary files shouldn’t cause any problems here.

gatk MarkDuplicates --TMP_DIR

Just like SortSam MarkDuplicates creates a lot of temporary files. MarkDuplicates will create a directory of its own, like /tmp/<user_name>/CSPI.nnnnnnnnnnnnnnnnnnn.tmp/ (with n being any number). Inside this directories a lot of different files are created.

Again the option --TMP_DIR <directory> was utilized, to redirect the creation of temporary files to /path/to/project_dir/GATK_TMP_DIR/.

gatk MarkDuplicates -ASO | --ASSUME_SORT_ORDER

MarkDuplicates offers an option to specify the given sort order of the input data, including unsorted. Surprisingly there is an error message when using this option:

picard.PicardException: This program requires input that are either coordinate or query sorted (according to the header, or at least ASSUME_SORT_ORDER and the content.) Found ASSUME_SORT_ORDER=unsorted and header sortorder=unsorted

Therefore the previous step of using GATK SortSam is obligatory. As a side note, even though no in-depth investigation was performed, sorting with samtools also was causing issues. So sticking to SortSam is recommended.

gatk MarkDuplicates -MAX_FILE_HANDLES

A final issue with MarkDuplicates is that it opens a plethora of file handles. Depending on the setup of the respective operating system, this can cause very subtle difficulties. The maximum number of allowed open file descriptors may be too small for MarkDuplicates. The current limits may be displayed via ulimit -Hn (hard limit) and ulimit -Sn (soft limit). Often 4096 is a given limit. In OVarFlow the option -MAX_FILE_HANDLES was set to 300.

gatk HaplotypeCaller and AVX instruction

GATK was heavily optimized to make use of the given instruction set of the respective CPU. Therefore run times can be heavily influenced by the presence of CPU instructions (e.g. SSE, SSE2, AVX, AVX512). This was also mentioned in the Video of the GATK4: Live Launch Invent (approx. at 23 min). Some performance optimizations were also achieved by a cooperation with Intel engineers:

This resulted in performance optimizations with improvements for PairHMM, used in Haplotype caller, for Intel® Xeon® processors with Intel® Advanced Vector Extensions 512 (Intel® AVX 512) and Intel FPGAs.

From personal experience (no systematic assessment), the absence of AVX will cause approx. five times longer runtimes of HaplotypeCaller. As this is a matter of days or even weeks, OVarFlow will refuse to run on a CPU without AVX.

gatk HaplotypeCaller parallelization

In previous versions of GATK HaplotypeCaller an option for parallelization was available (-nct and -nt). This is no longer the case with GATK 4. To achieve multithreading with GATK 4, the GATK team recommends the use of (Apache) Spark. Unfortunately at the time of writing (June 2020, GATK 4.1.7) the Spark version of HaplotypeCaller is still BETA, with the official waring:

Use the non-spark HaplotypeCaller if you care about the results.

To circumvent this problem, OVarFlow a so called scatter gather approach is used by OVarFlow. This allows for the automatic parallel evaluation of a user defined number of intervals per analyzed individual.

Java / GATK -XX:ParallelGCThreads=<2-4>

The Java virtual machine (JVM) will automatically set the number of garbage collection (GC) threads, depending on the number of available CPU cores or threads, respectively. The performance of some GATK tools is heavily impacted by the number of GC threads. Performance is clearly declining at higher thread counts. The best performance is seen at 2-4 Java GC threads. A detailed analysis can be found within the Java GC benchmarking section of this documentation.

gatk GatherVcfs and suffix of input data

The HaplotypeCaller generates a Genomic VCF file, abbreviated as GVCF. Sometimes the file suffix .gvcf is used for such files (or .gvcf.gz for the compressed data). GatherVcfs will not accept a GVCF file if the suffix is .gvcf, it explicitly has to be .g.vcf (or .g.vcf.gz). The same file with the wrong ending will produce a quite uninformative Java stack trace:

java.io.UncheckedIOException: java.nio.charset.MalformedInputException: Input length = 1
        at java.io.BufferedReader$1.hasNext(BufferedReader.java:574)
        at java.util.Iterator.forEachRemaining(Iterator.java:115)
        ...

Also the order of the contigs in the individual GVCF files has to be identical to the order of the contigs as defined in the VCF header section. Fortunately the error message is more informative here:

ERROR   2020-11-09 14:03:20     GatherVcfs      There was a problem with gathering the INPUT.java.lang.IllegalArgument
Exception: First record in file /path/to/file/interval_2.g.vcf.gz is not after first record in previous file /path/to/file/interval_1.g.vcf.gz

To avoid the above problems CombineGVCFs can be used instead of GatherVCFs, but the merging of interval files will take considerably longer and costs more resources.

Java / GATK -Xmx<value>g

Just as default Java GC thread numbers depend on the amount of given cores, the default heap allocation by the JVM depends upon the amount of given system memory. The default values of a specific machine can be determined via java -XshowSettings:vm. This is of importance as memory usage of some GATK tools is dependent upon the Max. Heap Size value. Higher values will result in higher memory consumption, thereby potentially limiting the amount of parallel processes. In the worst case the system might even run out of memory, ultimately leading to termination of processes.

GATK --tmp-dir

Many GATK applications will create files like this /tmp/tmp_read_resource_nnnnnnnnnnnnnnnnnnn.config, where n denotes any number. Those files won’t be deleted upon program termination. It appears like tools derived from Picard applications won’t create such files but “native” GATK applications like HaplotypeCaller, CombineGVCFs, SelectVariants and so forth. The creation of such files can also be redirected to a different directory, but the syntax of this option for “native” GATK tools is a little different --tmp-dir <directory>, with lower case letters as opposed to --TMP_DIR for Picard tools.

fastqc -Xmx1g

Although not part of GATK, the tool fastqc has pitfalls of its own. Interestingly for small fastq files (approx. 50 MB gzip fastq test file) the memory allocation pool has to be set explicitly. Otherwise the Java VM aborts with the following error:

...$ fastqc -o testFASTQC -f fastq FASTQ_INPUT_DIR/GGA081_R1.fastq.gz
Started analysis of GGA081_R1.fastq.gz
Approx 5% complete for GGA081_R1.fastq.gz
Approx 10% complete for GGA081_R1.fastq.gz
Approx 15% complete for GGA081_R1.fastq.gz
Approx 20% complete for GGA081_R1.fastq.gz
Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
        at uk.ac.babraham.FastQC.Utilities.QualityCount.<init>(QualityCount.java:33)
        at uk.ac.babraham.FastQC.Modules.PerTileQualityScores.processSequence(PerTileQualityScores.java:281)
        at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:88)
        at java.lang.Thread.run(Thread.java:745)

Therefore before invoking fastqc it is advisable to set an environment variable for the Java VM like this: export _JAVA_OPTIONS='-Xmx1g -XX:ParallelGCThreads=4. It is also advisable to reduce the number of Java GC threads.

snpEff

OVarFlow performs functional annotation of the identified variants using snpEff. To perform variant annotation this tool makes use of a database of the annotated genome. Even though this is not a bug but a feature, one should be aware of the fact, that snpEff normally stores this database some within the users home directory. As those databases can consume considerable amounts of space within the users home. Therefore OVarFlow changes this default behavior and stores the snpEff database within the project directory. Keeping all project specific data together in one place is advisable anyways.