Java Heap Space (-Xmx)

Global settings of the Java virtual machine (JVM) can cause major performance impacts on the respective GATK tool. In this regard Java Garbage Collection (GC) is only one aspect. Settings of the Java heap size cause a major influence on memory consumption of the JVM. Here two values affect heap size, as can be shown via java -X:

...
-Xms<size>        set initial Java heap size
-Xmx<size>        set maximum Java heap size
...

When starting the JVM -Xms is not set (and is not as important), but the values of -Xmx will be set depending on the given amount of memory the respective machine has to offer. The values of a certain machine can be determined via java -XshowSettings:vm 2>&1 | head. On various different machines the following values were obtained:

Memory

Max. Heap Size (Estimated)

16 Gb

3.48 Gb

64 Gb

13.98 Gb

256 Gb

26.67 Gb

512 Gb

26.67 Gb

1 Tb

26.67 Gb

Again just like with the number of Java GC threads, there is a situation were the default behavior is dependent upon the respective machine parameters. Finally heap size can have considerable effects on runtimes and obviously even more on memory usage. Therefore those GATK tools that work in parallel on several files were also monitored for various predefined heap sizes (1, 2, 4, 6, 8, 12, 16, 24, 32 and 48 Gb). Besides performance impacts too small values for the heap size will result in the lack of memory and can result in an java.lang.OutOfMemoryError.

Effect on GATK SortSam

1FILE=SRR3041137; GC=2
2/usr/bin/time -o ${LOG_FILE} --append -v \
3gatk --java-options "-Xmx${XMX}G -XX:ParallelGCThreads=${GC}" SortSam \
4   -I 01_mapping/${FILE}.bam \
5   -SO coordinate \
6   -O ${DIR}/02_sort_gatk_${FILE}.bam \
7   --TMP_DIR ./GATK_tmp_dir/ 2> ${DIR}/02_sort_gatk_${FILE}.log
Effect of Java Heap Size on GATK MarkDuplicates

CPU usage is negatively effected by low heap sizes, reaching a sustainable minimum at approx. 12 Gb. Generally memory usage raises with higher values for the Xmx setting, but with a drop at 8 Gb. The gray line in the RSS plot indicates parity between measured RSS and set Xmx values (meaning RSS = Xmx). It is obvious that CPU and memory usage cannot be minimized at the same time. Still simultaneous optimization of both parameters is possible with Xmx settings of 8 or 12 Gb. OVarFlow was set at 10 Gb for SortSam.

On a side note: setting identical values for Xms and Xmx did not result in higher memory usage. Even with higher Xms values memory will be initialized with a 0-page. But memory is only counted as RSS, when it is actually accessed and written to.

Effect on GATK MarkDuplicates

1FILE=SRR3041413; GC=2
2/usr/bin/time -o ${LOG_FILE} --append -v \
3gatk --java-options "-Xmx${XMX}G -XX:ParallelGCThreads=${GC}" MarkDuplicates \
4   -I 02_sort_gatk/${FILE}.bam \
5   -O ${DIR}/03_mark_duplicates_${FILE}.bam \
6   -M ${DIR}/03_mark_duplicates_${FILE}.txt \
7   -MAX_FILE_HANDLES 300 \
8   --TMP_DIR ./GATK_tmp_dir/ 2> ${DIR}/03_mark_duplicates_${FILE}.log
Effect of Java Heap Size on GATK MarkDuplicates

From 1 to 24 Gb Xmx settings, CPU usage is not noticeably affected. Only 32 and 48 Gb were moderately more demanding. Memory usage on the other hand raises nearly linear with Xmx settings. Lower heap size are clearly preferable for MarkDuplicates and were set to 2 Gb.

Effect on GATK HaplotypeCaller

1FILE=SRR3041137; GC=2
2/usr/bin/time -o ${LOG_FILE} --append -v \
3gatk --java-options "-Xmx${XMX}G -XX:ParallelGCThreads=${GC}" HaplotypeCaller \
4   -ERC GVCF -I 03_mark_duplicates/${FILE}.bam \
5   -R processed_reference/GCF_000002315.6_GRCg6a_genomic.fa.gz \
6   -O ${DIR}/${FILE}_tmp.gvcf.gz \
7   -L "NC_006093.5" 2> ${DIR}/${FILE}_tmp.log
Effect of Java Heap Size on GATK HaplotypeCaller

CPU usage of HaplotypeCaller is not effected by different Java heap sizes. Again there is a near linear relation between Xmx settings and actual memory usage, but starting from 4 Gb memory usage stays way below the allowed heap sizes. HaplotypeCaller was set to use 2 Gb memory for the Java heap size.

Effect on GATK GatherVcfs

1FILE=SRR3041137; GC=2
2/usr/bin/time -o ${LOG_FILE} --append -v\
3gatk --java-options "-Xmx${XMX}G -XX:ParallelGCThreads=${GC} GatherVcfs \
4   -O ${DIR}/05_gathered_samples_${FILE}.gvcf.gz \
5   -I 04_haplotypeCaller/${FILE}/interval_1.g.vcf.gz \
6   -I 04_haplotypeCaller/${FILE}/interval_2.g.vcf.gz \
7   -I 04_haplotypeCaller/${FILE}/interval_3.g.vcf.gz \
8   -I 04_haplotypeCaller/${FILE}/interval_4.g.vcf.gz \
9   --TMP_DIR ./GATK_tmp_dir 2> ${DIR}/05_gathered_samples_${FILE}.log
../../_images/Java_Xmx_GatherVcfs_SRR3041137.png

GatherVcfs is not significantly influenced by Java heap size settings. Only wall time of the first measurement is considerably higher. This is due to page caching of the processed data, which are kept in memory after they are first accessed. Also overall resource usage is very moderate and a significant advantage over CombineGVCFs, which was previously employed for this step. To allow for some resource tolerance heap size was set to 2 Gb.

Deprecated: Effect on GATK CombineGVCFs

CombineGVCFs was replaced by GatherVcfs.

1FILE=SRR3041137; GC=2
2/usr/bin/time -o ${LOG_FILE} --append -v \
3gatk --java-options "-Xmx${XMX}G -XX:ParallelGCThreads=${GC}" CombineGVCFs \
4   -O ${DIR}/05_gathered_samples_${FILE}.gvcf.gz \
5   -R processed_reference/GCF_000002315.6_GRCg6a_genomic.fa.gz \
6   -V 04_haplotypeCaller/${FILE}/interval_2.gvcf.gz \
7   -V 04_haplotypeCaller/${FILE}/interval_4.gvcf.gz \
8   -V 04_haplotypeCaller/${FILE}/interval_1.gvcf.gz \
9   -V 04_haplotypeCaller/${FILE}/interval_3.gvcf.gz 2> ${DIR}/05_gathered_samples_${FILE}.log
Effect of Java Heap Size on GATK CombineGVCFs

If there is a clear effect on CPU usage of the allowed heap size on CombineGVCFs it is mostly hidden under statistic variance. On the other hand effects on RSS values are rising from 1 to 12 Gb, where a maximum is reached. Java heap size of CombineGVCFs was set to 2 Gb.

OVarFlow and Java heap size

Overall CPU usage is only barely affected by different heap sizes. Only SortSam is an exception, were low heap sizes will significantly increase runtime. As expected, lower heap size settings (Xmx) are favorable to save some memory (RSS). Still some interesting drops in memory usage could be observed for some Xmx values.

To maximize performance while minimizing resource usage of OVarFlow the following values for the heap size (-Xmx<n>G were set within the Snakefile:

  • GATK SortSam: 10 Gb

  • GATK MarkDuplicates: 2 Gb

  • GATK HaplotypeCaller: 2 Gb

  • GATK GatherVcfs: 2 Gb

  • GATK CombineGVCFs: 2 Gb

By manually specification of a Java heap size, memory usage of the GATK tools could clearly be improved over the default values that applied to a machine with 64 Gb main memory.