Quick reference for OVarFlow

The extensive documentation of OVarFlow might seem daunting, illustrating the complexity of variant calling. Besides the inherent complexity of the task, the documentation tries to be as comprehensive as possible to assist novice users. On the other hand advanced users that already have a working Conda environment can set up the variant calling workflow in probably less than half an hour. A task that might take days to weeks is then automated by OVarFlow. This quick reference is for those advanced users that want to quickly setup a new project.

  1. Create a project directory (project_dir):

    1mkdir -p /path/to/project_dir/
    
  2. Create a Conda environment (conda_env) for your project (or use one that already available for variant calling) and activate this environment:

    1conda create --prefix /path/to/project_dir/conda_env
    2conda env update --prefix /path/to/project_dir/conda_env \↵
    3                 --file OVarFlow_dependencies_mini.yml
    4conda activate /path/to/project_dir/conda_env
    
  3. You need to create a directory structure and put some files from OVarFlow’s GitLab repository into place:

    /path/to/project_dir/
    /path/to/project_dir/conda_env/
    /path/to/project_dir/variant_calling/
    /path/to/project_dir/variant_calling/FASTQ_INPUT_DIR/
    /path/to/project_dir/variant_calling/REFERENCE_INPUT_DIR/
    /path/to/project_dir/variant_calling/OLD_GVCF_FILES/
    /path/to/project_dir/variant_calling/Snakefile
    /path/to/project_dir/variant_calling/scripts/average_coverage.awk
    /path/to/project_dir/variant_calling/scripts/createIntervalLists.py
    /path/to/project_dir/variant_calling/samples_and_read_groups.csv
    /path/to/project_dir/variant_calling/config.yaml # optionally
    

    Some of the files can be created through the OVarFlow Snakefile, to avoid typos:

    1cd /path/to/project_dir/variant_calling/
    2snakemake -np
    
  4. Place your reference and sequencing files into the appropriate directories.

  5. The configuration file samples_and_read_groups.csv has to be adopted for your specific project. Modify that file accordingly. It will also serve as a reference for your settings.

  6. An additional optional configuration file config.yaml allows for fine-tuning of Java resource usage and defining the degree of parallelization of the data evaluation.

  7. It is optional but advisable to test whether the annotation can be processed by snpEff at first, preventing late stage failure.

    1snakemake -p --cores <number_of_desired_threads> create_snpEff_db
    
  8. You can start the variant calling now:

    1cd /path/to/project_dir/variant_calling/
    2snakemake -p --cores <number_of_desired_threads>
    

That’s already everything to start your variant calling. Depending of the size of your data set and available computing resources, OVarFlow will take care of the rest of the process that might take even weeks, while you can continue working on other projects.

Finally you might want to document the exact software versions, that were used in the data evaluation. Just extract that information from your Conda environment:

1conda activate /path/to/project_dir/conda_env
2conda env export > conda_environment.yml

Adding the BQSR workflow

The above workflow will already result in a set of annotated variants that can be sufficient for further analysis. To further refine the called variants, the GATK team recommends to perform base quality score recalibration (BQSR). Therefore BQSR was implemented in a second workflow, that can optionally be run in succession of the first workflow, to further improve the called variants through BQSR.

  1. The BQSR workflow has to be run within the same directory where the previous workflow was executed. So cd into the project directory first:

    1cd /path/to/project_dir/
    
  2. Two files have to be copied from the GitLab repository:

    /path/to/project_dir/variant_calling/SnakefileBQSR
    /path/to/project_dir/variant_calling/configBQSR.yaml (optionally)
    
  3. The conda environment that was previously used has to be activated again:

    1conda activate /path/to/project_dir/conda_env
    
  4. The input data is automatically detected from the file structure generated in the previous workflow. This includes the following directories and files, that still have to be present:

    03_mark_duplicates/<file_names>.bam
    11_filtered_removed_VCF/variants_filtered.vcf.gz
    processed_reference/<file_name>.fa.gz
    snpEffDB/<directory_name>/<genes.gff, sequences.fa.gz, snpEffectPredictor.bin>
    

    A configuration file like previously samples_and_read_groups.csv is therefore neither needed nor used.

  5. Fine-tuning of the workflows performance is enabled through the configuration file configBQSR.yaml. This file is mainly about Java heap size and garbage collection threads, that can be optimized for a given computing environment.

  6. The BQSR workflow can now be started like this:

    1cd /path/to/project_dir/variant_calling/
    2snakemake -p --cores <number_of_desired_threads> -s SnakefileBQSR
    

Warning

Not every version of Snakemake works with OVarFlow. The workflow makes use of so called checkpoints. Due to a bug that was introduced in Snakemake versions higher than 5.26.1 checkpoints don’t work anymore. This bug was fixed in Snakemake 5.31.0. Therefore explicit software version were defined in OVarFlow_dependencies_mini.yml. In cases were it is desired, the most current software version can be obtained using the file OVarFlow_dependencies_mini_unversioned.yml.