Docker & Singularity usage

A different approach to use OVarFlow is its Docker container or image, respectively. The Docker container bundles all the software that is needed to execute the OVarFlow workflow. There is no need to install or download the individual components of OVarFlow or to create a Conda environment.

This simplification of course comes at a cost. First of all Docker requires system administrator privileges. Also the bundled software components can’t be updated, as it can be done with a Conda environments. Finally cluster usage (e.g. SGE) is not the scope of the OVarFlow container. Container usage of OVarFlow has been designed to be utilized on a single larger machine. The need for administrator privileges can be circumvented by the us of Singularity, as an alternative container virtualization technology, instead of Docker.

Docker

The Docker images of OVarFlow, to create a new container, are available on Docker Hub. Different versions of the image might be available. Each version has a distinct tag, showing the build date of the image. The Docker command can be used to download the image from Docker Hub and to make it locally available:

1docker pull ovarflow/release:<tag>

Of course <tag> has to be replaced with the version you want to download.

After downloading the image make sure, that it’s indeed locally available:

1docker images

Now prepare a directory for the workflow, where your sequencing data are made available and where the results of the workflow will be stored. You can name the directory arbitrarily (here project_dir). Also three more directories have to be created within this main directory:

mkdir  project_dir
mkdir  project_dir/FASTQ_INPUT_DIR
mkdir  project_dir/REFERENCE_INPUT_DIR
mkdir  project_dir/OLD_GVCF_FILES

The three uppercase directories can also be created by OVarFlow, but in any case you would need to fill in the respective content manually. Which is also the next step to perform.

project_dir/FASTQ_INPUT_DIR: Has to contain your Illumina sequencing files in fastq format. For each individual to be analyzed you have to provide two files, one file (R1) containing the forward reads and one file (R2) the reverse reads. If you have various sequencing files for each individual merge all forward and all reverse reads beforehand.
project_dir/REFERENCE_INPUT_DIR: Has to contain two files: a reference genome in fasta format and a reference annotation in gff format. Files that were obtained from the RefSeq have been utilized successfully with OVarFlow.
project_dir/OLD_GVCF_FILES: May contain some gvcf files from previous variant callings. This allows for the inclusion of individuals that were already analyzed, thereby the most time consuming steps (including mapping and variant detection with HaplotypeCaller) won’t have to be recomputed. Of course those files must have been analyzed with the same reference genome and annotation, as is used in this analysis.

Finally a csv file called samples_and_read_groups.csv has to be present in the project_dir. This file serves for the configuration of the workflow, telling OVarFlow which files to use. Thereby this csv file also serves documentational purposes. A sample of this file can be obtained from OVarFlows GitLab repository. A detailed description of the file format can be found under Conda & Snakemake usage => The CSV configuration file.

Now that everything is prepared you can create and execute a docker container of OVarFlow:

1docker run -it -v /path/to/project_dir:/input ovarflow/release:<tag>

Of course the <tag> has to be replaced with the version you’re using. The option -v will bind mount a volume within the container. Thereby the directory /path/to/project_dir is made available under the path /input within the running container.

Now the OVarFlow workflow is already running and no further manual interaction should be required.

Resource utilization

The Docker Image has been designed to make high use of the available resources. The number of available CPU cores (or threads to be more precise) is automatically detected. The OVarFlow will then use available cores - 4 within its Snakemake workflow. For instance if 32 cores are available OVarFlow will use 28 of those cores, with the internal Snakemake command of snakemake -p --cores 28 --snakefile /snakemake/Snakefile. Of course OVarFlow allows for the modification resource utilization. In this case an additional option has to be passed to the OVarFlow container forwarding an environment variable:

docker run -it -e THREADS='<number>' -v /path/to/project_dir:/input ovarflow/release:<tag>

In any case the <number> that is passed to OVarFlow should not exceed the number of available threads. It is the users responsibility to take care of this.

Obtaining the yml file

If you need to know about the single software versions that are used within OVarFlow’s Docker container, you can also extract that information from the container. To do so you must first open a shell within the container.

docker run -it -v /home/ubuntu/project_dir:/input ovarflow/release:<tag> /bin/bash

Within the running container make the Conda environment available and extract the version information to a yml file:

conda init bash
bash
activate conda OVarFlow
conda env export > /input/conda_env_OVarFlow.yml
exit; exit

The above commands perform the following actions: (1) initializes Conda. (2) the changes made in the previous step must be made available within a newly opened bash shell. As can be seen from the changes to the prompt ((base) root@...:/#) the Conda base environment is now active. (3) activates the OVarFlow Conda environment. The prompt changes again ((OVarFlow) root@...:/#). (4) exports the OVarFlow environment into a yml, that will be written to /path/to/project_dir, outside of the Docker Container. (5) will log you out of the two opened bash shells.

Final note on Docker

One thing that has to be mentioned is, that every time docker run is invoked, a new container is created from the OVarFlow Docker image. To get an overview of the containers that were already created execute docker ps -a. It might be reasonable to sometimes delete old containers docker rm <container_name>.

Singularity

Singularity allows you to do the same tasks that Docker does, but without the need for administrator privileges. Making Singularity a popular choice in high performance scientific computing. Also usage of Singularity containers is generally a bit different from Docker images and containers. First of all create a sif file (Singularity image format) from the Docker image. The data will be retrieved from Docker Hub:

1singularity build OVarFlow_<tag>.sif docker://ovarflow/release:<tag>

This sif file contains the whole OVarFlow workflow including all software dependencies. Now prepare a project_dir as it was done with Docker (see above). The workflow can now be started via:

1singularity run --bind /path/to/project_dir:/input OVarFlow_<tag>.sif

Just like with Docker, executing OVarFlow with Singularity, will autodetect the number of cores (threads) that are available on the respective computer. Again the default setting of the used number of cores is available cores - 4. Changing this setting by setting an environment variable called THREADS and then running the Singularity container:

export THREADS=<desired_number_of_threads>
singularity run --bind /path/to/project_dir:/input OVarFlow_<tag>.sif

Manual start of OVarFlow

Singularity also makes the OVarFlow workflow accessible from a command line. Singularity easily allows to run a shell within the container.

1singularity shell --bind /path/to/project_dir:/input OVarFlow_<tag>.sif

This command will bind mount (--bind) the project directory within the container under the path /input. Also the users home directory is automatically available within the container. The root folder (/) of the host operating system will be overlaid by the root of the container. Therefore the bind mount command is needed as no directory outside of the users home will be available otherwise.

In case that there is a warning bash: warning: setlocale: LC_ALL: cannot change locale (en_US.utf8), the message can be ignored. It won’t interfere with the workflow.

After opening the shell, you might for instance want to perform a dry run of Snakemake:

cd /input
snakemake -np --snakefile /snakemake/Snakefile

Or start the actual workflow, like it would be done with the manual installation of OVarFlow:

cd /input
snakemake -p --cores <threads> /snakemake/Snakefile

Starting the BQSR-workflow only requires a different Snakefile:

cd /input
snakemake -np --snakefile /snakemake/SnakefileBQSR
snakemake -p --cores <threads> /snakemake/SnakefileBQSR

Obtaining the yml file

The exact software versions, that are being used in the Singularity container, can also be extracted into a yml file. First of all a shell can easily be opened within the Singularity container:

1singularity shell OVarFlow_<tag>.sif

The users home directory will automatically be mounted with the now running Singularity container, and all data from the home directory are thereby accessible. Besides this the whole content of the container is available. Therefore the OVarFlow Conda environment can be activated and exported. The commands are identical to ones used with the Docker container:

conda init bash
bash
conda avtivate OVarFlow
conda-env export > /path/to/project_dir/conda_env_OVarFlow.yml
exit; exit