Conda & Snakemake usage

Setting up a Conda environment and manually executing the Snakefile is the recommended way to use OVarFlow. This method will allow you to have the maximum control of the whole process. Also, if desired, you can determine exactly which versions of the respective programs shall be used.

What is needed

Before you can begin with variant calling using OVarFlow, some prerequisites have to be fulfilled. Especially certain files have to be given. Most of the time externally supplied files:

  • a reference genome in fasta format (the higher the quality the better) and

  • a reference annotation in gff format (same as above).

Most of the time provided by you:

  • Illumina sequencing files in fastq format of various individuals and

  • (optionally) GVCF files of previous variant calling workflows on the same reference genome.

The optional incorporation of previously created GVCF files allows you to include variants you determined previously, saving you from recomputation of those variants.

Furthermore you will need to provide:

  • a specific directory structure containing the previous mentioned files,

  • a CSV file for configuration,

  • the Snakefile and its Python scripts and

  • a Conda environment containing the needed applications.

This section will teach you how to set everything up. It is assumed that you have a working Conda installation. If not, have a look at the “Setup & Preparations” section. In the end the following directory structure will be created:

/path/to/project_dir/
/path/to/project_dir/conda_env/
/path/to/project_dir/variant_calling/
/path/to/project_dir/variant_calling/FASTQ_INPUT_DIR/
/path/to/project_dir/variant_calling/REFERENCE_INPUT_DIR/
/path/to/project_dir/variant_calling/OLD_GVCF_FILES/
/path/to/project_dir/variant_calling/Snakefile
/path/to/project_dir/variant_calling/scripts/average_coverage.awk
/path/to/project_dir/variant_calling/scripts/createIntervalLists.py
/path/to/project_dir/variant_calling/samples_and_read_groups.csv
/path/to/project_dir/variant_calling/config.yaml

Creating a Conda environment

Most of the time Conda environments reside in the home directory of the respective user, which is fine on single user systems. In larger computational environments it is advisable to create project specific Conda environments, that reside by your data.

1conda create --prefix /path/to/project_dir/conda_env

The above command will create a so far empty Conda environment that is named conda_env and resides under the specified path. The next step is to install all applications required by OVarFlow. For this you can use a yml file, listing all direct dependencies of OVarFlow.

1conda env update --prefix /path/to/project_dir/conda_env \↵
2                 --file OVarFlow_dependencies_mini.yml

Now you can activate the Conda environment, so the installed applications are available in your $PATH variable.

1conda activate /path/to/project_dir/conda_env

If the command was successful your prompt will change to show the path of the Conda environment at the beginning of the prompt in parentheses.

Preparing OVarFlow

With the Conda environment active, you now have access to all needed applications. Lets create a directory for the actual variant calling:

1mkdir /path/to/project_dir/variant_calling

Now four files from OVarFlow’s GitLab repository have to be placed within this directory as follows:

1/path/to/project_dir/variant_calling/Snakefile
2/path/to/project_dir/variant_calling/scripts/createIntervalLists.py
3/path/to/project_dir/variant_calling/scripts/average_coverage.awk
4/path/to/project_dir/variant_calling/samples_and_read_groups.csv
5/path/to/project_dir/variant_calling/config.yaml

The Python script (createIntervalLists.py) must be executable (file permission x):

1cd /path/to/project_dir/variant_calling/scripts/
2chmod ug+x createIntervalLists.py

Finally three directories have to be created, that will harbor the reference genome and annotation, the fastq Illumina sequencing files and GVCF files from previous variant callings. Those three directories can either be created manually or the OVarFlow Snakefile will create them. If OVarFlow creates those directories any spelling mistakes are prevented:

1cd /path/to/project_dir/variant_calling/
2snakemake -np

The script will prompt about the creation of those directories and exit itself. Now three additional directories have been created:

1/path/to/project_dir/variant_calling/FASTQ_INPUT_DIR/
2/path/to/project_dir/variant_calling/REFERENCE_INPUT_DIR/
3/path/to/project_dir/variant_calling/OLD_GVCF_FILES/

The respective files have to be placed into the corresponding directory. It has to be mentioned that OVarFlow has only been tested with a single pair of forward and reverse fastq files per individual. So in case your reads are derived from various lanes or sequencing runs you should combine all forward and reverse reads in single R1 (forward) and R2 (reverse) files.

The CSV configuration file

The last mandatory step before the actual variant calling is the preparation of a configuration file:

1/path/to/project_dir/variant_calling/samples_and_read_groups.csv

A template of this file is also available within OVarFlow’s GitLab repository. Besides its purpose as configuration file it also serves as documentation of the data evaluation. This CSV (colon separated values) file has to list the files that were placed in the previously created directories. Furthermore it contains the read group information for each pair of sequencing files. Also short contigs can be excluded from the reference genome. Lower quality genomes usually are more fragmented and can contain thousands of small contigs besides some larger ones. In cases when you’re not interested in the small contigs you might exclude those contigs, thereby accelerating data evaluation. An excerpt of the CSV file might look like this:

Reference Sequence:

cow_ref.fa.gz

Reference Annotation:

cow_ref.gff.gz

Min sequence length:

800

old gvcf to include:

previous_1.gvcf.gz

previous_2.gvcf.gz

forward reads

reverse reads

ID

more read group data

cow.purple_R1.fastq.gz

cow.purple_R2.fastq.gz

id_1

values …

Static values of the CSV file are set in bold face. The other fields have to be put in by the user of OVarFlow. Of course if you don’t have any GVCF files from previous data evaluations those fields are empty, still its heading will have to be present. If you don’t want to exclude any short sequences you can set the value to 1, meaning that every contig would at least have to contain 1 base. Due to space limitations not every read group data is listed here.

To fill in your own values into the CSV file you might use a text editor of your liking (UTF-8 encoding and Unix line breaks should be supported). Also you might use a spreadsheet calculation program. LibreOffice Calc has been tested for this purpose and works just fine. This offers the advantage of the more human readable formatting.

What are read groups?

The previous section mentioned read groups. This section is for those people that are not familiar with this term. The GATK website offers a detailed description. Basically a read group is the set of reads, that is produced in a sequencing experiment. In practical use of OVarFlow it’s best to think of it as the forward and reverse reads of a single individual.

The read group data now are some meta data of the sequencing experiment. Those data will be used in the mapping of the Illumina reads and become incorporated within the bam files. Also those information will be used as a heading in the final vcf file to identify the columns of the respective individuals. Therefore an especially meaningful and unique name has to be chosen for the SM - unique sample name field. To list all of the required read group data fields:

  • forward reads: those files have to end with _R1.fastq.gz

  • reverse reads: those files have to end with _R2.fastq.gz

  • ID: a unique ID of your liking (e.g. ID_sampleName)

  • PL - platform technology: OVarFlow has been tested and designed to be used with Illumina reads, so the value is Illumina

  • CN - sequencing center: an ID for the sequencing center that generated the reads

  • LB - library name: a unique ID of your liking (e.g. lib_sampleName)

  • SM - uniq sample name: choose a short, unique name for the sample

The YAML configuration file

Optionally further fine-tuning of the workflow is permitted through a final configuration file:

1/path/to/project_dir/variant_calling/config.yaml

Many applications used within the workflow are based upon Java. Resource usage of the Java virtual machine (JVM) is strongly influenced by the underlying hardware. Unfortunately default values for Java heap size and the number of parallel garbage collection threads are not always set to optimal values. While reasonable default values are already defined within the Snakefile, the yaml file allows for modifications in case that a certain data set requires unique settings.

Also the degree of parallelization of the workflow can be modified. The default settings will adjust bwa to use six threads for each mapping and HaplotypeCaller to operate on four intervals in parallel. Depending on the structure of the reference genome the number of intervals that are evaluated in parallel cannot be guaranteed, as a given genome is only split between full contigs. Individual contigs won’t be split.

Generally this file doesn’t need to be present, but it enables adjustments if special circumstances should cause a demand to do so.

Starting the workflow

Now that your Conda environment is active and all files are in place, it’s time to start the actual variant calling workflow of OVarFlow. First of all change into the varaint_calling directory were the Snakefile resides:

1cd /path/to/project_dir/variant_calling/

OVarFlow also allows for the functional annotation of the detected variants. This is done within the last step of the workflow deploying snpEff as a tool to do so. As this is the last step of a long process it is especially annoying if this step fails. From personal experience snpEff is not able to make use of every genome annotation. Gff annotations available from the RefSeq have proven to be reliable. Still it is recommended to test if snpEff can make use of the provided reference annotation. If it fails, it is better to fail early. First perform a dry run (snakemake -np) and then the actual creation of the snpEff database:

1snakemake -np create_snpEff_db
2snakemake -p create_snpEff_db

Actual creation of the database might take a considerable amount of time. As OVarFlow is based upon Snakemake it will detect if a database is already available and won’t recompute it during the workflow. In case anything failed have a look at the log file:

1less -SN /path/to/project_dir/variant_calling/logs/snpEffDB/Huhn_2Mio/report_stdout.log
2less -SN /path/to/project_dir/variant_calling/logs/snpEffDB/Huhn_2Mio/report_stderr.log

Now that the annotation has shown to be usable a dry run of the complete workflow is advisable. This won’t perform any actual work, but will prompt all the steps that have to be executed during the variant calling workflow:

1snakemake -np

Finally the actual variant calling workflow can be started. Depending on your given hardware resources you will probably want to parallelize the whole process. This can easily be done by providing a command line option to Snakemake. The --cores <number> switch will advice Snakemake to use the given number of threads/cores (alternatively, the number of jobs can be specified --cores <number>). This allows OVarFlow to operate on several files or several genomic intervals in parallel, thereby accelerating the computation and shortening the required time.

1snakemake -p --cores <number>

The rest is automatically handled by OVarFlow. Depending on the provided computational resources and number as well as size of sequencing files computation might take several days or even some weeks. During the process several new directories will be created, storing intermediate and final results:

00_FastQC
01_mapping
02_sort_gatk
03_mark_duplicates
04_haplotypeCaller
05_gathered_samples
06_combined_calls
07_genotypeGVCFs
08_split_SNPs_Indels
09_hard_filtering
10_merged_filtered_VCF
11_filtered_removed_VCF
12_annotated_variants
benchmarks
interval_lists
logs
processed_reference
snpEffDB

Alternative targets

In the above usage, the default Snakemake workflow target rule is applied. To be even more versatile, some alternative target rules are available that execute only a subset of the entire workflow. This may save some computation time for unnecessary calculations. However, it is always possible to rerun the entire workflow, as Snakemake will recognize results that have already been calculated and resume the workflow from there. The following alternative target rules are available:

noSnpEff

In case that no functional annotation of the detected variants is desired, an alternative target is available, called noSnpEff. By specifying this as the last option of the workflow invocation, everything will be executed as before, except for the functional annotation of the variants.

1snakemake -p --cores <number> noSnpEff

Without executing SnpEff, there is also no need to specify a reference annotation within the CSV configuration file. Still, the respective line has to be present in the file but no option has to be stated, e.g.:

1Reference Sequence:,SampleSeq.fa
2Reference Annotation:,
3...

variantsPerSample

The target variantsPerSample is not available in OVarFlow 2.

This alternative target rule executes the workflow up to the variant calling of each individual samples. Thereby, every directory including 05_gathered_samples will be created.

1snakemake -p --cores <number> variantsPerSample

dedubBAM

Finally, the workflow can also be utilized to perform just the mapping, including sorting and marking of duplicated reads. In this process, every directory including 03_mark_duplicates will be created. Also, statistics of the average coverage will be calculated.

1snakemake -p --cores <number> dedubBAM