Welcome to OVarFlow’s documentation!

OVarFlow is an open source workflow for variant discovery of SNVs (single nucleotide variants) and indels (insertions and deletions). With today’s high-throughput sequencing technologies and continuously declining sequencing costs, variant discovery in whole-genome resequencing data is not only more affordable but also more demanded than ever. Hence the need for easy and reliable variant calling emerges in a broader audience. Consequently OVarFlow was created with the three major goals of:

  • automation,

  • documentation and

  • reproducibility.

To achieve those goals OVarFlow is build upon several technologies that are proven and widely used in bioinformatics, being Snakemake as a workflow management system, Conda as an environment manager and software repository and GATK as a variant discovery toolkit.

Target audience

Variant calling is no novel task. Especially GATK not only provides the tools for variant discovery but also its well known Best Practices Workflows. A downside of those guidelines is their focus on human sequencing data, being the probably best studied model organism. With less well studied organisms workflows often have to diverge significantly. OVarFlow steps into this gap and provides variant discovery also for non-model organisms in the fields of:

  • biological basic research,

  • animal breeding and

  • plant breeding.

For the latter, only haploid organisms have been tested. Tetraploid organisms might require adaptation of the workflow (especially GATK HaplotypeCaller).

Premises to use OVarFlow

To be able to use OVarFlow some requirements must be fulfilled in the first place. Obviously whole-genome resequencing data of the respective organism must be given. Like GATK Best Practices Workflows OVarFlow is designed to be used with Illumina short read sequencing data. Furthermore a reference genome sequence must be given and also a reference annotation, in case that functional annotation is desired.

To be able to analyze those data, some more technical requirements must also be considered. First of all access to a Linux based computing infrastructure of sufficient size must be given. What is sufficient of course depends upon the size of your data set. Some hints are given within the “Resource requirements” and “Hardware recommendations” sections.

Finally on the side of human resources prior knowledge of the Unix/Linux command line with some proficiency is required. Apart from this no prior knowledge (albeit helpful) is expected in any of the used technologies (Snakemake, Conda or GATK). This documentation tries to be as comprehensive as possible, introducing the technologies as needed and linking to further resources to get you started. For users with prior experience in the mentioned technologies the “Quick reference for OVarFlow” might already be everything you need to get started. Those users might be able to setup variant calling in 30 minutes. The rest is handled by OVarFlow.

Motivation behind OVarFlow

Some of the motivation behind the creation of OVarFlow should already be obvious from the previous paragraphs. But still some might wonder why all the hassle when GATK Best Practices are not only a detailed description, but are also commonly referred to in method sections of various papers (e.g. PMID 29395925, PMID 24824529, PMID 30952207). In the end many method sections mentioning GATK are rather superficial (e.g. PMID: 31246983, PMID: 31900978), sometimes not even mentioning the names of GATK subtools used in the analysis. This was also noted by the initiators of GATK, therefore writing:

6. What is not GATK Best Practices?

Lots of workflows that people call GATK Best Practices diverge significantly from our recommendations. […] However, any workflow that has been significantly adapted or customized, whether for performance reasons or to fit a use case that we do not explicitly cover, should not be called “GATK Best Practices”, which is a term that carries specific meaning.

Source: About the GATK Best Practices (date of accession: May 7th 2020).

Another problem is, that GATK Best Practices evolve over time, ultimately rendering global references to them (like http://www.broadinstitute.org/gatk/guide/best-practices) useless. Thereby reproducibility of the exact data evaluation workflow is lost. Irreproducible research even lead to the coining of the phrase replication crisis which is an ongoing problem in science. A problem that even major science publishers like nature (Special: Challenges in irreproducible research - 2018) are more and more aware off.

Therefore the main motivation behind OVarFlow is to achieve exact documentation and reproducibility of data evaluation. It is the kind of openness that science should offer!

OVarFlow achieves this goal by four key points:

  • the OVarFlow Snakefile and workflow itself,

  • the documentation of Conda environments in a yml file,

  • documentation of the analyzed dataset in a CSV file and

  • documentation of non-default workflow settings in a yml file.

This results in a maximum of documentation and reproducibility of the data analysis and in addition eases writing of any methods section, by providing those four files. Also users of OVarFlow are encouraged not only to use OVarFlow but also to adopt it to their specific needs and then to republish their modified workflow.

With that being said, good luck with your variant discovery project and the hope that the following documentation will turn out to be useful in your work!

GATK pitfalls