Capabilities of OVarFlow

The complexity of variant calling with all its distinct data evaluation steps can be a daunting task. OVarFlow tries to wrap as much of this complexity as possible, thereby automating this intricate process to a maximum degree. Especially the usage of GATK with its hundreds of single tools is challenging to novice users. But not only the amount of tools is challenging also their individual usage with some very subtle obstacles, e.g.:

  • High peak loads caused by the Java garbage collection in dependence on the number of available cores of the CPU.

  • Considerable extended computation times depending on the given instruction set of the CPU.

All those complexities have been taken into consideration and were incorporated into OVarFlow. But not only the intrinsic complexity of variant calling and GATK is encapsulated by OVarFlow. Furthermore OVarFlow was extended to include features that GATK does not possess directly.

Some highlights of OVarFlow

Massive parallelization

Not only a high degree of parallelization, but also the ability to fine-tune the desired degree of parallelization. Parallelization of GATK HaplotypeCaller version 3 was abandoned with the newer GATK 4 version, only leaving Apache Spark as an option. With OVarFlow GATK 4 HaplotypeCaller can operated in parallel on various genomic intervals thereby accelerating the most time consuming step of variant discovery.

Inclusion of already available variant calls

Previously generated variant calls (vcf files) can easily be incorporated into new data evaluations. This allows for easy incorporation of new individuals into running studies, without the need to recalculate all samples.

Exclusion of small genomic contigs

Many genomes contain small contigs of e.g. 1000 bp or even less. Often those tiny contigs are of no further interest. OVarFlow lets the user decide whether to include those tiny contigs into the analysis or not. Furthermore the threshold of contig sizes to exclude can be chosen by the user.

Functional variant annotation

Also functional annotation of the detected variant is automatized by incorporating the annotation program SnpEff into the workflow.

Easy application installation

Variant calling depends upon a fast software set. By the use of Conda environments, installation of all needed applications is basically scaled down to a single command. Alternatively a single, pre-built Docker container already bundles all the required software packages.

The two phases of OVarFlow

OVarFlow is a variant calling workflow, that posesses two separate phases.

The basic variant calling workflow

First workflow is mandatory. It is designed to be as basic as possible and a the briefest way to deliver annotated variants. Therefore minimum prior knowledge is required. Only a reference genome and annotation is required.

The extended BQSR workflow

The second workflow is optional and builds upon the basic variant calling workflow. It uses previously called variants to perform base quality score recalibration (BQSR) and further improve the variant calling results.

The primary goal of OVarFlow

Finally the main goal of OVarFlow is documentation and reproducibility of variant calling, which is achieved by three components:

  • OVarFlow as a workflow itself.

  • Easy documentation of the used program versions via Conda environments and yml files.

  • A CSV file to document the respective variant calling and all the input data used in it.