Example & Tutorial

This section is supposed to give a realistic example for the usage of OVarFlow. It will show how to set up and analyze a project. To do so whole genome resequencing data of a chicken (Gallus gallus) will serve as sample data. Chicken was chosen for several reasons:

generally it would be an organism within the scope of OVarFlow
the genome is of reasonable length with approx. 1 Gb, reducing the analysis time for the example
a reference genome (GRCg6a) and annotation of reasonable quality are available
Illumina paired end sequencing data are readily available (e.g. PRJNA291174)

The test data set

A script has been deposited in the OVarFlow repository to automatically download a suitable test data set (get_chicken_low_coverage.sh). This script will automatically create the two directories Fastq and Reference within the working directory and download four Illumina paired end data sets as well as a reference sequence and annotation.

Even with a smaller data set, variant analysis of an entire genome takes several days, which will give a reasonable idea of the duration of variant calling. In case that a rather quick example is desired, a tiny test data set can be created from the full data set. The script create_mini_data_set.sh serves this purpose (approx. runtime on a 24 core machine 1 1/4 h). This script has to be executed within the same directory as get_chicken_low_coverage.sh. The script requires samtools, bwa and bgzip. All of those programs are available in the OVarFlow Conda Environment.

OVarFlow execution

Now that a sample data set is available you can begin to set up data evaluation with OVarFlow. This example will make use of the full data set, as downloaded with get_chicken_low_coverage.sh. Also to keep the installation of all applications very easy, this example will make use of the Singularity container of OVarFlow, but the container will be used in a manual fashion. By opening a shell within the container, the Snakefile can be started manually.

Step 1: Obtaining OVarFlow

First of all a reasonably current version of Singularity has to be installed on the respective system (3.5.1 has been tested, most 3.x versions should do). With Singularity installed the OVarFlow container can be obtained from Docker Hub:

1singularity build OVarFlow_<tag>.sif docker://ovarflow/release:<tag>

The <tag> has to be substituted with the most current version of OVarFlow.

Step 2: Creating a project directory

Create a project directory (OVarFlow_Chicken) under a reasonable path on your system and cd into this directory:

mkdir -p /your/path/OVarFlow_Chicken
cd /your/path/OVarFlow_Chicken

Within this project directory you have to create several subdirectories:

mkdir FASTQ_INPUT_DIR
mkdir REFERENCE_INPUT_DIR
mkdir OLD_GVCF_FILES

The fastq files from the test data set - remember where you downloaded those files - have to be available from the FASTQ_INPUT_DIR directory. Therefore those files could be either copied (wastes disk space) or moved to the directory. The creation of links is an alternative, that we will be using here. The same applies for the reference genome and annotation:

for fastq in /path/to/test/data/Fastq/*fastq.gz
do
    ln -s "${fastq}" FASTQ_INPUT_DIR/"${fastq##*/}"
done

ln -s /path/to/test/data/Reference/GCF_000002315.6_GRCg6a_genomic.fna.gz REFERENCE_INPUT_DIR/GCF_000002315.6_GRCg6a_genomic.fna.gz
ln -s /path/to/test/data/Reference/GCF_000002315.6_GRCg6a_genomic.gff.gz REFERENCE_INPUT_DIR/GCF_000002315.6_GRCg6a_genomic.gff.gz

OVarFlow relies on a naming convention, to recognize paired end sequencing files. Forward reads need the suffix _R1.fastq.gz and reverse reads the suffix _R2.fastq.gz. SRR fastq files lack this specific suffix. Rename those files accordingly and make sure that the links are not broken:

cd FASTQ_INPUT_DIR
mv SRR2131198_1.fastq.gz SRR2131198_R1.fastq.gz
mv SRR2131198_2.fastq.gz SRR2131198_R2.fastq.gz
mv SRR2131199_1.fastq.gz SRR2131199_R1.fastq.gz
mv SRR2131199_2.fastq.gz SRR2131199_R2.fastq.gz
mv SRR2131201_1.fastq.gz SRR2131201_R1.fastq.gz
mv SRR2131201_2.fastq.gz SRR2131201_R2.fastq.gz
mv SRR2131202_1.fastq.gz SRR2131202_R1.fastq.gz
mv SRR2131202_2.fastq.gz SRR2131202_R2.fastq.gz
ls -l

Step 3: Adapt the CSV configuration file

With all needed files in place a CSV configuration file has to be created and placed into your data evaluation directory OVarFlow_Chicken. An example file to get you started is also available from the OVarFlow repository (samples_and_read_groups.csv). Download this file and copy it to your project directory:

1cp /path/to/file/samples_and_read_groups.csv /your/path/OVarFlow_Chicken/

This file has to be edited. You have to enter the fastq files that shall be processed, the reference that shall be used and some read group information. To do this, the CSV file can be opened in a text editor of your liking, or a spread sheet application like LibreOffice Calc. The latter is more convenient for inexperienced users, but make sure to save any changes to the file again in CSV format! For this project the modified file should look like this:

Reference Sequence:,GCF_000002315.6_GRCg6a_genomic.fna.gz
Reference Annotation:,GCF_000002315.6_GRCg6a_genomic.gff.gz

Min sequence length:,2000

old gvcf to include:,

forward reads,reverse reads,ID,PL - plattform technology,CN - sequencing center,LB - library name,SM - uniq sample name
SRR2131198_R1.fastq.gz,SRR2131198_R2.fastq.gz,id_98,illumina,ENA,lib_98,SRR98
SRR2131199_R1.fastq.gz,SRR2131199_R2.fastq.gz,id_99,illumina,ENA,lib_99,SRR99
SRR2131201_R1.fastq.gz,SRR2131201_R2.fastq.gz,id_201,illumina,ENA,lib_201,SRR201
SRR2131202_R1.fastq.gz,SRR2131202_R2.fastq.gz,id_202,illumina,ENA,lib_202,SRR202

The Min sequence length has been chosen arbitrarily for this example. In your personal projects choose a value that is reasonable to you. If you don’t want to exclude short contigs set the Min sequence length to 1.

The read group data are needed by some GATK tools. The SM - uniq sample name is probably most important to you. This name will appear within the final annotated results file. Choose a name that identifies the respective individual reasonably well and uniquely. Here a part of the SRR-number was chosen.

Step 4: Activate the OVarFlow Environment

Using the Singularity container saves us from the need to install Conda and setting up a Conda environment. Of course if you lack Singularity you can also manually setup a Conda environment. A shell can easily be opened within the Singularity container. Go to the directory where you issued singularity build thereby creating a .sif container in that directory (see “Step 1”):

singularity shell --bind /path/to/OVarFlow_Chicken:/input OVarFlow_<tag>.sif
bash
cd /input

The first command opens a shell within the container and also bind mounts the project directory within the container under the path /input. The second command opens another bash shell within the container. By doing so the prompt changes to the Conda base environment ((base) user@host:~$). The third command changes into to the project directory. All project files are now available within the container under the path /input.

Step 5: Start the OVarFlow workflow

First of all you should perform a dry run of the workflow, to see if every rule will be executed correctly:

snakemake -np -s /snakemake/Snakefile

If this command succeeds you can start the real data evaluation. The dry run option (-n) has to be removed and a reasonable number of threads to archive parallelization has to be given (--cores <number>):

snakemake -p --cores <number> -s /snakemake/Snakefile

Step 6: Lean back

If everything was set up correctly OVarFlow will now take care of the variant calling.

Resource usage

Within the above example a rather small project was processed. Still, when considering the provided input data, it should be possible to get a rough idea of the resource requirements even for larger projects.

The reference genome

GRCg6a total length (Mb): 1065.37

The input data

8 fastq files (for & rev): 45 Gb
low coverage: 10 - 15 fold
(30 fold is desirable in real projects)

The finished project directory

size of all output data: 217 Gb

01_mapping: 55 Gb
02_sort_gatk: 55 Gb
03_mark_duplicates: 55 Gb
04_haplotypeCaller: 15 Gb
05_gather_samples: 15 Gb
06_combined_calls: 14 Gb
07_genotypeGVCFs: 1022 Mb
08_split_SNPs_Indels: 1035 Mb
09_hard_filtering: 1037 Mb
10_merged_filtered_VCF: 1024 Mb
11_filtered_removed_VCF: 986 Mb
12_annotated_variants: 1334 Mb

Total runtime to compute

computations were performed on a 28 core machine with the following command: snakemake -p --cores 26 -s ./Snakefile

total runtime: 37 h 36 min

As variant calling on eukaryotic genomes is a computationally demanding task, runtimes of days are the norm and can even extend to weeks for larger projects. The actual time to finish for a real project will depend on several factors:

the genome size of the organism
the sequencing depth
the number of individuals to analyze (e.g. the number of fastq files)
the degree of variability of the organism (more variants mean longer runtimes)
the given computing resources

For the above example the following hardware was used:

Architecture:	x86_64
CPU(s):	28
Thread(s) per core:	1
Model name:	Intel Core Processor (Broadwell)
CPU MHz:	2593.906
Hypervisor vendor:	KVM
Virtualization type:	full
Flags:	… avx … (this is important)
Main memory:	64 Gb