Hardware recommendations

Resource requirements to perform variant calling with OVarFlow depend on the size of the respective project. Here the main factors are the organisms genome size and of course the number of data sets or individuals within the study, respectively. Non the less some general recommendations can be made. A first impression can also be obtained from the benchmarking of the “entire workflow” (especially the last image of the section).

Concerning the hardware three key components have to be considered:

CPU

Of course a high single thread performance is always helpful to accelerate calculations. But major parts of variant calling can be parallelized, so that several data sets (and even intervals) can be processed in parallel. Therefore a high number of CPU cores is considerably more helpful. Finally a shortage of processing power will result in longer waiting times till results are calculated.

As a final note: the CPU must support the AVX (Advanced Vector Extensions) instruction set extension, as this drastically increases calculations performed by HaplotypeCaller.

Main memory (RAM)

OVarFlow has been designed and tested to be quite efficient when it comes to memory usage. Still no final number of memory requirements can be given. In the end memory requirements depend on the number of samples, number of HaplotypeCaller intervals and processing steps that are run in parallel. A lack of memory cannot be compensated (swapping is not advisable here) and will result in invocation of the out of memory killer (OOM killer), thereby terminating running processes. The solution to such a situation would be to run less parallel Snakemake jobs or to provide more main memory.

Data storage

Variant calling depends on large amounts of sequencing data, that are processed in multiple steps. This produces large amounts of intermediate data. Several terabytes of data can be produced easily. For instance 526 Gb of compressed sequencing data (32 fastq.gz files) resulted in 2.8 Tb of output data, when processed with OVarFlow. Besides storage space, file system latency and throughput are of concern. Slow storage, be it local disk storage or network storage, can slow down the whole analysis as well.

No single hardware component acts in isolation. Therefore the system has to be considered as a whole unit and single bottlenecks have to be identified. Adding more CPU cores won’t help if memory is already scarce.

Also high performance computing (HPC) is best suited for variant calling, smaller projects could also be realised with smaller hardware budgets. Here a very rough estimation shall be given, based upon hardware availability in 2020. The given estimations come without any warranty and are based upon personal experience and estimates:

Desktop Computers: Most desktop computers are not suitable for variant calling, as they do not offer the required number of CPU cores. For smaller projects, meaning less then 10 individuals or smaller genomes (e.g. Drosophila melanogaster with approx. 144 Mb genome size) a desktop CPU like AMD Ryzen (TM) 9 5950X or 9 3950X (16 cores / 32 threads) or comparable combined with at least 64 Gb main memory (better 128 Gb) might be suitable.
High End Desktop: In the last couple of years High End Desktop (HEDT) computers approached the performance previously reserved to server computers. Especially with AMD’s line of Threadripper (TM) processors up to 64 cores and 128 threads are available within a single CPU (as of the begining of 2021). When combined with 256 Gb of main memory medium sized projects with some dozens of individuals might be possible. Storage space will probably be a problem though.
Servers and Clusters: For large scale projects with hundreds of samples a dedicated infrastructure is definitely required. The usage of a compute cluster, for instance based upon Son of Grid Engine, is the way to go.