GATK Pitfalls
Variant calling is no trivial task, involving many different applications. Not only is the usage and interaction of those applications often far from obvious, but also especially some GATK tools possess some very intricate pitfalls. Dealing with those for the first time can be very time consuming. During the development of OVarFlow many of those pitfalls were unraveled and are now taken care of by the workflow. Still the most annoying and time consuming issues shall be explained here.
- gatk SortSam
--TMP_DIR
SortSam creates a lot of temporary files of the form
/tmp/<user_name>/`sortingcollection.nnnnnnnnnnnnnnnnnnn.tmp
(with n being any number). Depending on the size of the input data this can easily add up to dozens of gigabytes, thereby quickly consuming space under/tmp
. Depending on the partition scheme further problems can result from this. Also in case of abortion of SortSam, those temporary files are not automatically deleted. In those cases manual deletion or user lockout might be required.To circumvent this problem OVarFlow stores temporary files under
/path/to/project_dir/GATK_TMP_DIR/
. This is archived by SortSam’s--TMP_DIR <directory>
option. As the project directory has to provide reasonable amounts of storage, large temporary files shouldn’t cause any problems here.- gatk MarkDuplicates
--TMP_DIR
Just like SortSam MarkDuplicates creates a lot of temporary files. MarkDuplicates will create a directory of its own, like
/tmp/<user_name>/CSPI.nnnnnnnnnnnnnnnnnnn.tmp/
(with n being any number). Inside this directories a lot of different files are created.Again the option
--TMP_DIR <directory>
was utilized, to redirect the creation of temporary files to/path/to/project_dir/GATK_TMP_DIR/
.- gatk MarkDuplicates
-ASO | --ASSUME_SORT_ORDER
MarkDuplicates offers an option to specify the given sort order of the input data, including
unsorted
. Surprisingly there is an error message when using this option:picard.PicardException: This program requires input that are either coordinate or query sorted (according to the header, or at least ASSUME_SORT_ORDER and the content.) Found ASSUME_SORT_ORDER=unsorted and header sortorder=unsorted
Therefore the previous step of using GATK SortSam is obligatory. As a side note, even though no in-depth investigation was performed, sorting with
samtools
also was causing issues. So sticking to SortSam is recommended.- gatk MarkDuplicates
-MAX_FILE_HANDLES
A final issue with MarkDuplicates is that it opens a plethora of file handles. Depending on the setup of the respective operating system, this can cause very subtle difficulties. The maximum number of allowed open file descriptors may be too small for MarkDuplicates. The current limits may be displayed via
ulimit -Hn
(hard limit) andulimit -Sn
(soft limit). Often 4096 is a given limit. In OVarFlow the option-MAX_FILE_HANDLES
was set to 300.- gatk HaplotypeCaller and AVX instruction
GATK was heavily optimized to make use of the given instruction set of the respective CPU. Therefore run times can be heavily influenced by the presence of CPU instructions (e.g. SSE, SSE2, AVX, AVX512). This was also mentioned in the Video of the GATK4: Live Launch Invent (approx. at 23 min). Some performance optimizations were also achieved by a cooperation with Intel engineers:
This resulted in performance optimizations with improvements for PairHMM, used in Haplotype caller, for Intel® Xeon® processors with Intel® Advanced Vector Extensions 512 (Intel® AVX 512) and Intel FPGAs.
From personal experience (no systematic assessment), the absence of AVX will cause approx. five times longer runtimes of HaplotypeCaller. As this is a matter of days or even weeks, OVarFlow will refuse to run on a CPU without AVX.
- gatk HaplotypeCaller parallelization
In previous versions of GATK HaplotypeCaller an option for parallelization was available (
-nct
and-nt
). This is no longer the case with GATK 4. To achieve multithreading with GATK 4, the GATK team recommends the use of (Apache) Spark. Unfortunately at the time of writing (June 2020, GATK 4.1.7) the Spark version of HaplotypeCaller is still BETA, with the official waring:Use the non-spark HaplotypeCaller if you care about the results.
To circumvent this problem, OVarFlow a so called scatter gather approach is used by OVarFlow. This allows for the automatic parallel evaluation of a user defined number of intervals per analyzed individual.
- Java / GATK
-XX:ParallelGCThreads=<2-4>
The Java virtual machine (JVM) will automatically set the number of garbage collection (GC) threads, depending on the number of available CPU cores or threads, respectively. The performance of some GATK tools is heavily impacted by the number of GC threads. Performance is clearly declining at higher thread counts. The best performance is seen at 2-4 Java GC threads. A detailed analysis can be found within the Java GC benchmarking section of this documentation.
- gatk GatherVcfs and suffix of input data
The HaplotypeCaller generates a Genomic VCF file, abbreviated as GVCF. Sometimes the file suffix
.gvcf
is used for such files (or.gvcf.gz
for the compressed data). GatherVcfs will not accept a GVCF file if the suffix is.gvcf
, it explicitly has to be.g.vcf
(or.g.vcf.gz
). The same file with the wrong ending will produce a quite uninformative Java stack trace:java.io.UncheckedIOException: java.nio.charset.MalformedInputException: Input length = 1 at java.io.BufferedReader$1.hasNext(BufferedReader.java:574) at java.util.Iterator.forEachRemaining(Iterator.java:115) ...
Also the order of the contigs in the individual GVCF files has to be identical to the order of the contigs as defined in the VCF header section. Fortunately the error message is more informative here:
ERROR 2020-11-09 14:03:20 GatherVcfs There was a problem with gathering the INPUT.java.lang.IllegalArgument Exception: First record in file /path/to/file/interval_2.g.vcf.gz is not after first record in previous file /path/to/file/interval_1.g.vcf.gz
To avoid the above problems CombineGVCFs can be used instead of GatherVCFs, but the merging of interval files will take considerably longer and costs more resources.
- Java / GATK
-Xmx<value>g
Just as default Java GC thread numbers depend on the amount of given cores, the default heap allocation by the JVM depends upon the amount of given system memory. The default values of a specific machine can be determined via
java -XshowSettings:vm
. This is of importance as memory usage of some GATK tools is dependent upon the Max. Heap Size value. Higher values will result in higher memory consumption, thereby potentially limiting the amount of parallel processes. In the worst case the system might even run out of memory, ultimately leading to termination of processes.- GATK
--tmp-dir
Many GATK applications will create files like this
/tmp/tmp_read_resource_nnnnnnnnnnnnnnnnnnn.config
, wheren
denotes any number. Those files won’t be deleted upon program termination. It appears like tools derived from Picard applications won’t create such files but “native” GATK applications like HaplotypeCaller, CombineGVCFs, SelectVariants and so forth. The creation of such files can also be redirected to a different directory, but the syntax of this option for “native” GATK tools is a little different--tmp-dir <directory>
, with lower case letters as opposed to--TMP_DIR
for Picard tools.- fastqc
-Xmx1g
Although not part of GATK, the tool
fastqc
has pitfalls of its own. Interestingly for small fastq files (approx. 50 MB gzip fastq test file) the memory allocation pool has to be set explicitly. Otherwise the Java VM aborts with the following error:...$ fastqc -o testFASTQC -f fastq FASTQ_INPUT_DIR/GGA081_R1.fastq.gz Started analysis of GGA081_R1.fastq.gz Approx 5% complete for GGA081_R1.fastq.gz Approx 10% complete for GGA081_R1.fastq.gz Approx 15% complete for GGA081_R1.fastq.gz Approx 20% complete for GGA081_R1.fastq.gz Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space at uk.ac.babraham.FastQC.Utilities.QualityCount.<init>(QualityCount.java:33) at uk.ac.babraham.FastQC.Modules.PerTileQualityScores.processSequence(PerTileQualityScores.java:281) at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:88) at java.lang.Thread.run(Thread.java:745)
Therefore before invoking
fastqc
it is advisable to set an environment variable for the Java VM like this:export _JAVA_OPTIONS='-Xmx1g -XX:ParallelGCThreads=4
. It is also advisable to reduce the number of Java GC threads.- snpEff
OVarFlow performs functional annotation of the identified variants using
snpEff
. To perform variant annotation this tool makes use of a database of the annotated genome. Even though this is not a bug but a feature, one should be aware of the fact, that snpEff normally stores this database some within the users home directory. As those databases can consume considerable amounts of space within the users home. Therefore OVarFlow changes this default behavior and stores the snpEff database within the project directory. Keeping all project specific data together in one place is advisable anyways.