Configuration & adaptation

Every variant calling task and every computer system has its own specifics. Therefore, variant calling via OVarFlow can be adjusted by two configuration files. The design decision for two different files was made consciously, as both serve different purposes. One file provides the sample data, the other configures the workflow and its resource usage. Prototypes of both files are to be found within the OVarFlow repository.

The CSV file

The file samples_and_read_groups.csv is mandatory. As the name implies it provides all data that are related to the respective sample. In doing so, this file is also interesting for the biological side or cooperation partner, respectively. Therefore, the comma separated values (CSV) format was chosen, as it can easily be displayed and edited by spreadsheet applications.

The prototype of the file looks as follows:

Reference Sequence:,Huhn_2Mio.fna.gz
Reference Annotation:,Huhn_2Mio.gff.gz

Min sequence length:,2000

old gvcf to include:,fake_1.gvcf.gz,fake_2.gvcf.gz

forward reads,reverse reads,ID,PL – plattform technology,CN – sequencing center,LB – library name,SM – uniq sample name
GGA081_R1.fastq.gz,GGA081_R2.fastq.gz,id_GGA081,illumina,UBern,lib_GGA081,GGA081
GGA112_R1.fastq.gz,GGA112_R2.fastq.gz,id_GGA112,illumina,UBern,lib_GGA112,GGA112

With this file lines 1 to 9 have to be present. The number of lines starting from line 10 is arbitrary. It only depends on the number of samples that shall be processed. But don’t include empty lines. Overall the file is used to provide the following information and has to be modified accordingly:

Reference sequence: Position: line 1, field 2

Allowed endings: .fa.gz, .fna.gz, .fasta.gz
Reference annotation: Position: line 2, field 2

Allowed endings: .gff.gz

Other formats might work, but were not tested.
Min sequence length: Position: line 4, field 2

Short contigs within the reference genome can be excluded from the analysis. Here the cutoff value for the minimum sequence length is defined. This value can be set to 1 to include any contig, no matter of its length. A value has to be given in any case.
Old gvcf files: Position: line 7, field 2 to n

Allowed endings: .gvcf.gz

OVarFlow can pickup gvcf files containing the variants of single individuals. Caution: Those variants have to be called on the same reference genome as used in the current analysis! Additional gvcf files have to be listed in succession, separated by commas. The fields 2 to n may be left blank, if no given variants shall be included.
Sample information: Position: line 9 to n, fields 1 to 7

Here the sample information of the given analysis has to be listed. This includes the name of the fastq.gz files and read group data. All fields are required. There is a mandatory naming scheme for the filename suffix: _R1.fastq.gz for forward reads and _R2.fastq.gz for reverse reads. At least line 9 (a single sample) has to be present, with an arbitrary number of succeeding lines. No empty lines shall be present.

The yaml file

The file config.yaml is fully optional. Most of the time OVarFlow will work without this file. Still it provides the ability to change some internal settings of OVarFlow. OVarFlow has been highly optimized (see section: Benchmarking & Optimizations). However, not every possible analysis can be foreseen. There might be combinations of genomes and sequencing data that require different settings for the Java virtual machine. Such modifications of OVarFlow are possible through this configuration file.

Considering the purpose of this file it is obvious that it is intended for the technical or bioinformatics user of the analysis. Therefore, such information were separated from the more biological data of the CSV file.

The prototype of the file looks as follows:

# yaml file listing optionally available configuration
# options for OVarFlow
# no option nor the yaml file itself must be present
# here the default options of OVarFlow are listed

heapSize:
    SortSam         : 10
    MarkDuplicates  : 2
    HaplotypeCaller : 2
    GatherIntervals : 2
    GATKdefault     : 12

ParallelGCThreads:
    SortSam         : 2
    MarkDuplicates  : 2
    HaplotypeCaller : 2
    GatherVcfs      : 2
    CombineGVCFs    : 2
    GATKdefault     : 4

Miscellaneous:
    BwaThreads      : 6
    BwaGbMemory     : 4
    GatkHCintervals : 4
    HCnpHMMthreads  : 4
    GATKtmpDir      : "./GATK_tmp_dir/"
    MaxFileHandles  : 300
    MemoryOverhead  : 1

Debugging:
    CSV             : False
    YAML            : False

The above template also documents the default settings that are hard-coded within the OVarFlows Snakefile. Therefore, no changes would be applied using this file. Of course, all the numeric values can be changed. The purpose of the individual blocks are as follows:

heapSize

Range of accepted values: 1 - 40

Here the Java heap size is defined, that will be used by the individual GATK applications. The values are provided in Gb. A value of e.g. 4 will be equivalent to setting the Java option -Xmx4G. The given names are equivalent to the respective GATK application. Only GatherIntervals is equivalent to GatherVcfs (in earlier versions CombineGVCFs) in cases were the HaplotypeCaller was acting on individual intervals and not the whole genome. All remaining GATK applications, that are not executed in parallel, will use a default value of 12 Gb.

The default values were chosen to minimize memory footprint while still being quite generic. Of course, not every possible combination of a reference genome and sequencing data can be tested. In case that the provided amount of memory for a specific analysis is not sufficient Java will throw a java.lang.OutOfMemoryError error. Under these circumstances increase the heap size of the affected application for the specific analysis.

ParallelGCThreads

Range of accepted values: 1 - 20

The number of ParallelGCThreads has been optimized to be resource efficient while still allowing for quick data evaluation. Currently it is rather unlikely that any changes are required. Still, just to be on the save side, changes are possible. Finally changes to future GATK tools cannot be foreseen, that might require a modification. Changes are equivalent to modifying the JVM option -XX:ParallelGCThreads=<n>.

Miscellaneous

Range of accepted values:

bwa threads: 1 - 40

bwa memory: 1 - 20 (gigabyte)

HaplotypeCaller intervals: 1 - 100

HaplotypeCaller native pair hmm threads: 1 - 6

tmp directory: /tmp[/subdirectory] or ./<name>[/subdirectory]

File handles MarkDuplicates: 10 - 40000

Java memory overhead 0 - 10 (gigabyte)

Settings of the quantity of bwa mem thread and HaplotypeCaller intervals influence the parallelization of the respective application. Of course, reasonable values depend on the given hardware resources, which are ultimately limiting. No perfect generic value can be set, that is suitable for every user. The setting for bwa mem adjusts the number of parallel mapping threads. The number of HaplotypeCaller intervals splits the reference genome into intervals that are processed in parallel.

For the mapping tasks that use bwa mem, a resource request of 4 gb memory should be fine. However, some combinations of sequencing data (fastq) and reference genome (fa) might require more than 4 gb of memory. In particular in cluster usage, where resource requests are strictly controlled, exceeding these requests might result in application termination. Thus, in such a situation, memory limits have to be increased.

The HaplotpyeCaller possesses an option to adjust the number of so-called native pairHMM threads. The higher this number, the higher the CPU usage of the HaplotypeCaller (also see benchmarking section). To execute more HaplotypeCallers in parallel a value of 1 should be chosen. In this case each individual HaplotypeCaller will run a little bit slower. Otherwise 4 is the optimal value (which is the default).

The directory where GATK stores temporary data can be configured. Allowed values are /tmp and subdirectories therein as well as any directory within the current working directory (./<name>). Directory names have to consist of alphanumeric characters, ., _, - and /.

GATK MarkDuplicates tends to open an enormous amount of files. The number of open file descriptors can be beyond the limits of some systems (see ulimit -Hn and ulimit -Sn). A fixed value of 300 file descriptors is the default of OVarFlow but can be adjusted to fit custom needs.

The workflow has been designed so that memory usage can be scheduled via the --resources mem_gb=<n> command line option, just like CPU usage can be schedule through the --cores <n> directive. For Java applications, memory scheduling takes into account the respective heap size plus an additional overhead for non-heap memory. This overhead is set to 1 gigabyte by default. This setting is particularly useful when memory resources are requested in cluster usage.

Debugging

Range of accepted values: True or False (no quotation marks)

This option enables the debugging output, showing the settings provided by the csv (sample file) and the yaml file (configuration file). Actually any value that translates to a boolean Python value is possible, but True is the only reasonable choice.

Not only is the usage of the yaml file optional, also not every setting is required. If default settings for most applications shall be preserved those settings don’t need to be provided. The following example would also be a valid config.yaml file:

heapSize:
    GatherIntervals : 5

Miscellaneous:
    BwaThreads      : 8
    GatkHCintervals : 8

Memory recommendations

As noted above, some combinations of given reference genome and sequencing data may require different settings in the config.yaml file. Genome size, genome fragmentation, and sequencing depth affect memory requirements. Most thorough testing has been performed using a chicken genome (GRCg6a) and sequencing data up to 34-fold average coverage. Based on those experiences, default values were selected that were also found to be appropriate for various mammalian datasets. However, some mammalian data might require more memory, in particular if the risk of a failed workflow is to be avoided at all costs (thanks to Snakemake, a failed workflow can be restarted at any time). There are two situations where low memory requests can cause the workflow to fail:

A java.lang.OutOfMemoryError if the heap size was chosen to low. MarkDuplicates and the HaplotypeCaller might be affected by this.
An oom-kill event can occur in cluster usage where resource request are strictly controlled. An error message like the following is indicative for such a situation: Some of your processes may have been killed by the cgroup out-of-memory handler. Such an error happens when the “Java memory overhead” or if memory request for bwa is insufficient.

A minimal configuration, which should be fairly safe, looks like this:

heapSize:
    MarkDuplicates  : 4
    HaplotypeCaller : 3

Miscellaneous:
    BwaGbMemory     : 8

When choosing such values, keep in mind that unnecessarily high values block resources, especially when using compute clusters. To learn about reasonable values for the given use case, one can perform an initial run with save settings and observe the actual consumed resident set size based on the recorded benchmarks in the benchmarks directory. A sorted list of the consumed memory, for instance during mapping, can be obtained with the following shell commands:

cd benchmarks/01_mapping
for f in *bm
do
    tail -1 $f | cut -f3
done | sort -n | less

The largest values can then be used as an indicator of the upper bounds of the memory requirements of the given data set.