Resource requirements

Variant calling is a computationally demanding task. There are three main hardware factors, that can create a bottleneck in the data analysis, being (1) the number and speed of central processing units (CPUs) or cores to be precise, (2) amount of main memory and (3) disk space. No single one of those components can be considered individually. When increasing the number of CPUs memory size might become a new bottleneck. Of course the exact hardware requirements depend on the size of your respective project and acceptable waiting time for the calculations to finish. This is influenced by various factors:

  • the genome size of the organism under study,

  • the frequency of genomic variants,

  • the sequencing depth and

  • of course the number of individuals that shall be analyzed.

OVarFlow has been designed to be able to adopt to various project sizes and is able to scale with the provided computational resources. When ever possible various tasks are executed in parallel. Especially the GATK HaplotypeCaller, which is probably the biggest single bottleneck, has been optimized for parallelization. The higher the number of intervals, that are specified by the user, the higher the parallelization degree of the HaplotypeCaller.

Of course the total resource requirements depend on the resource usage of the single applications that are used in OVarFlow. Therefore the resource usage of individual applications has been investigated closely, with a focus on CPU and memory usage. The following sections will document the obtained results and applied optimizations.