A Primer into the technologies

For novice users, with only a basic understanding of bioinformatics, this section is supposed to serve as a very brief introduction into the technologies that are used in OVarFlow. In this sense the following paragraphs are more of a technical reference. As OVarFlow makes heavy use of the technologies outlined below, this will help novice users with the terminology used in further sections.

Python 3

OVarFlow makes is build using the Python 3 programming language. Python is a scripting language, meaning that the source code of the respective program is interpreted by a special run-time environment which doesn’t need to be compiled first. To be able to execute a Python program the Python interpreter (usually CPython - the reference implementation of the interpreter) as well as its accompanying modules have to be installed.

Snakemake

Snakemake is a workflow management system (WFMS). WFMS allow for the creation of defined sequences of tasks a computer can execute. Practically this means the automatization of the successive or even parallel execution of various programs. Thereby the main usage of Snakemake is the creation of automatic, reproducible data analysis workflows.

Snakemake itself is written in the Python programming language and the Python syntax, with some specific extensions, is used to write workflows in Snakemake. Therefore a basic understanding of Python is required to create a workflow using Snakemake. The file containing the workflow is called a Snakefile.

Conda & Bioconda

Many bioinformatics tasks revolve about the well orchestrated execution of a plethora of command-line tools. WFMS like Snakemake can automate such processes. On the other hand the individual software has to be obtained and installed on the executing system. Package management system simplify this task tremendously, often removing the need to compile software packages from source code.

Conda is an open source package management system, that find broad application in data sciences. Beyond that it is also an environment management system, that allows for the independent installation of several versions of a software, without causing dependency conflicts. Even though some bioinformatics tools can be found via Conda it is not specialized in bioinformatics use cases. This cap is bridged by Bioconda, which is a so called channel for Conda, targeted a bioinformatics tools. Basically the amount of available tools is increased by adding the Bioconda channel to the Conda package manager.

GATK & GATK Best Practices

GATK is the commonly used abbreviation for the Genome Analysis Toolkit. This collection of command-line tools is focused on the identification of genomics variants in high throughput sequencing data. It bundles more than 200 individual command-line tools. GATK is actively developed at the Broad Institute with the current major version being GATK 4 (in 2020). As opposed to previous versions GATK 4 is open-source and freely available under a BSD 3-clause license.

Some common use cases of GATK are described within the so-called “GATK Best Practices”. Those descriptions try to give an overview of tasks and workflows that are widespread in variant calling. As the tools are further developed, the “GATK Best Practices” are also subject to modifications.

Docker & Singularity

Like OVarFlow many modern software products are composed of many individual pieces, whose well orchestrated interaction is mandatory for the final product to work. Deployment of such complex software applications can be complex. OS-level virtualization, also referred to as container virtualization, tries to simplify software deployment by bundling all individual components of an application in a single container. Container virtualization finds wide usage among software developers and server administrators but is less focused on end users. Most container virtualization techniques have been designed around Linux and are supposed to be used with this operating system (still ports to Windows and macOS do exist).

Docker is probably the most well known container technology. Despite its wide use, it has some considerable drawbacks, when used in multiuser computer environments. Most problematic is the fact, that Docker containers need to be executed with system administrator privileges. This renders Docker unusable in many multiuser computer environments. This drawback of Docker is circumvented by Singularity. Singularity, just like Docker, packages individual software components into a single container. Furthermore it can make use of Docker containers, but does not require system administrator privileges. Therefore Singularity finds broad use in academic high performance computing, where multiple users need to access a single system.