Java Heap Space (-Xmx)
Global settings of the Java virtual machine (JVM) can cause major performance impacts on the respective GATK tool. In this regard Java Garbage Collection (GC) is only one aspect. Settings of the Java heap size cause a major influence on memory consumption of the JVM. Here two values affect heap size, as can be shown via java -X
:
...
-Xms<size> set initial Java heap size
-Xmx<size> set maximum Java heap size
...
When starting the JVM -Xms
is not set (and is not as important), but the values of -Xmx
will be set depending on the given amount of memory the respective machine has to offer. The values of a certain machine can be determined via java -XshowSettings:vm 2>&1 | head
. On various different machines the following values were obtained:
Memory |
Max. Heap Size (Estimated) |
16 Gb |
3.48 Gb |
64 Gb |
13.98 Gb |
256 Gb |
26.67 Gb |
512 Gb |
26.67 Gb |
1 Tb |
26.67 Gb |
Again just like with the number of Java GC threads, there is a situation were the default behavior is dependent upon the respective machine parameters. Finally heap size can have considerable effects on runtimes and obviously even more on memory usage. Therefore those GATK tools that work in parallel on several files were also monitored for various predefined heap sizes (1, 2, 4, 6, 8, 12, 16, 24, 32 and 48 Gb). Besides performance impacts too small values for the heap size will result in the lack of memory and can result in an java.lang.OutOfMemoryError.
Effect on GATK SortSam
1FILE=SRR3041137; GC=2
2/usr/bin/time -o ${LOG_FILE} --append -v \
3gatk --java-options "-Xmx${XMX}G -XX:ParallelGCThreads=${GC}" SortSam \
4 -I 01_mapping/${FILE}.bam \
5 -SO coordinate \
6 -O ${DIR}/02_sort_gatk_${FILE}.bam \
7 --TMP_DIR ./GATK_tmp_dir/ 2> ${DIR}/02_sort_gatk_${FILE}.log
CPU usage is negatively effected by low heap sizes, reaching a sustainable minimum at approx. 12 Gb. Generally memory usage raises with higher values for the Xmx setting, but with a drop at 8 Gb. The gray line in the RSS plot indicates parity between measured RSS and set Xmx values (meaning RSS = Xmx). It is obvious that CPU and memory usage cannot be minimized at the same time. Still simultaneous optimization of both parameters is possible with Xmx settings of 8 or 12 Gb. OVarFlow was set at 10 Gb for SortSam.
On a side note: setting identical values for Xms and Xmx did not result in higher memory usage. Even with higher Xms values memory will be initialized with a 0-page. But memory is only counted as RSS, when it is actually accessed and written to.
Effect on GATK MarkDuplicates
1FILE=SRR3041413; GC=2
2/usr/bin/time -o ${LOG_FILE} --append -v \
3gatk --java-options "-Xmx${XMX}G -XX:ParallelGCThreads=${GC}" MarkDuplicates \
4 -I 02_sort_gatk/${FILE}.bam \
5 -O ${DIR}/03_mark_duplicates_${FILE}.bam \
6 -M ${DIR}/03_mark_duplicates_${FILE}.txt \
7 -MAX_FILE_HANDLES 300 \
8 --TMP_DIR ./GATK_tmp_dir/ 2> ${DIR}/03_mark_duplicates_${FILE}.log
From 1 to 24 Gb Xmx settings, CPU usage is not noticeably affected. Only 32 and 48 Gb were moderately more demanding. Memory usage on the other hand raises nearly linear with Xmx settings. Lower heap size are clearly preferable for MarkDuplicates and were set to 2 Gb.
Effect on GATK HaplotypeCaller
1FILE=SRR3041137; GC=2
2/usr/bin/time -o ${LOG_FILE} --append -v \
3gatk --java-options "-Xmx${XMX}G -XX:ParallelGCThreads=${GC}" HaplotypeCaller \
4 -ERC GVCF -I 03_mark_duplicates/${FILE}.bam \
5 -R processed_reference/GCF_000002315.6_GRCg6a_genomic.fa.gz \
6 -O ${DIR}/${FILE}_tmp.gvcf.gz \
7 -L "NC_006093.5" 2> ${DIR}/${FILE}_tmp.log
CPU usage of HaplotypeCaller is not effected by different Java heap sizes. Again there is a near linear relation between Xmx settings and actual memory usage, but starting from 4 Gb memory usage stays way below the allowed heap sizes. HaplotypeCaller was set to use 2 Gb memory for the Java heap size.
Effect on GATK GatherVcfs
1FILE=SRR3041137; GC=2
2/usr/bin/time -o ${LOG_FILE} --append -v\
3gatk --java-options "-Xmx${XMX}G -XX:ParallelGCThreads=${GC} GatherVcfs \
4 -O ${DIR}/05_gathered_samples_${FILE}.gvcf.gz \
5 -I 04_haplotypeCaller/${FILE}/interval_1.g.vcf.gz \
6 -I 04_haplotypeCaller/${FILE}/interval_2.g.vcf.gz \
7 -I 04_haplotypeCaller/${FILE}/interval_3.g.vcf.gz \
8 -I 04_haplotypeCaller/${FILE}/interval_4.g.vcf.gz \
9 --TMP_DIR ./GATK_tmp_dir 2> ${DIR}/05_gathered_samples_${FILE}.log
GatherVcfs is not significantly influenced by Java heap size settings. Only wall time of the first measurement is considerably higher. This is due to page caching of the processed data, which are kept in memory after they are first accessed. Also overall resource usage is very moderate and a significant advantage over CombineGVCFs, which was previously employed for this step. To allow for some resource tolerance heap size was set to 2 Gb.
Deprecated: Effect on GATK CombineGVCFs
CombineGVCFs was replaced by GatherVcfs.
1FILE=SRR3041137; GC=2
2/usr/bin/time -o ${LOG_FILE} --append -v \
3gatk --java-options "-Xmx${XMX}G -XX:ParallelGCThreads=${GC}" CombineGVCFs \
4 -O ${DIR}/05_gathered_samples_${FILE}.gvcf.gz \
5 -R processed_reference/GCF_000002315.6_GRCg6a_genomic.fa.gz \
6 -V 04_haplotypeCaller/${FILE}/interval_2.gvcf.gz \
7 -V 04_haplotypeCaller/${FILE}/interval_4.gvcf.gz \
8 -V 04_haplotypeCaller/${FILE}/interval_1.gvcf.gz \
9 -V 04_haplotypeCaller/${FILE}/interval_3.gvcf.gz 2> ${DIR}/05_gathered_samples_${FILE}.log
If there is a clear effect on CPU usage of the allowed heap size on CombineGVCFs it is mostly hidden under statistic variance. On the other hand effects on RSS values are rising from 1 to 12 Gb, where a maximum is reached. Java heap size of CombineGVCFs was set to 2 Gb.
OVarFlow and Java heap size
Overall CPU usage is only barely affected by different heap sizes. Only SortSam is an exception, were low heap sizes will significantly increase runtime. As expected, lower heap size settings (Xmx) are favorable to save some memory (RSS). Still some interesting drops in memory usage could be observed for some Xmx values.
To maximize performance while minimizing resource usage of OVarFlow the following values for the heap size (-Xmx<n>G
were set within the Snakefile:
GATK SortSam: 10 Gb
GATK MarkDuplicates: 2 Gb
GATK HaplotypeCaller: 2 Gb
GATK GatherVcfs: 2 Gb
GATK CombineGVCFs: 2 Gb
By manually specification of a Java heap size, memory usage of the GATK tools could clearly be improved over the default values that applied to a machine with 64 Gb main memory.