Advanced usage topics

The previous section was a broad introduction into the general usage of OVarFlow. The commands were basically executed on a single computer or server and program execution was supposed to proceed flawlessly. However that’s the ideal use case. This section is dedicated to the execution of OVarFlow in a more advanced computer environment and gives advice in case of failure. Furthermore advice is given to achieve reproducibility even on different computational hardware.

Cluster usage - SGE

Variant calling is computationally demanding. Common desktop computers are by no means suitable for this task. So that at least high end desktop (HEDT) CPUs are required, offering two dozens of cores or even more. Of course the more resource are made available the quicker the computation can be finished. Therefore usage of a compute cluster should also be considered.

Obviously this section is dedicated to the cluster usage of OVarFlow. As OVarFlow is based upon Snakemake, all options privided by Snakemake for cloud computing and cluster usage are available as well. OVarFlow has successfully been used with Sun Grid Engine (SGE) (Son of Grid Engine is its successor). The following examples have to be adopted for the use with other cluster management software.

On a single large cluster node

One issue with GATK is, that some of its applications cause rather short load peaks, while using much more moderate resource during the majority of their runtime. When using a compute cluster, a general rule of thumb is to request the maximum amount of resources, that is to be expected during runtime. The request of maximum resource requirements will prevent a slowdown of the cluster node, but also waste the resources that are not used most of the time.

One idea to handle this contradiction is to reserve all resources of one cluster node to be used by OVarFlow. Peak loads are more or less ignored within this approach. OVarFlow is then configured to utilize all available resources most of the time. So if there is a slowdown within the used cluster node, due to peak loads, only OVarFlow will be affected, thereby not standing in the way of other users.

This approach has successfully been used on an SGE based cluster. If you’re using a different cluster management software you’ll have to adopt the steps accordingly. Complex cluster jobs are best submitted using a submit shell script. At the beginning of such a script the resource requests can be configured. If chosen correctly this will result in the exclusive reservation of the cluster node for OVarFlow.

This procedure is only recommended if reasonably capable cluster nodes are available. In a test a single cluster node provided:

  • 40 threads (Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz)

  • 256 GB of memory

By requesting 40 cores (<parallel_environment_name> has to be replaced with your local configuration), the node was exclusively reserved. Also the hostname= directive will let you choose a certain target host. The use of wildcards allows for the selection of a range of certain cluster nodes. An example of a whole submit script is listed below:

 1#!/bin/bash
 2
 3#$ -q queue_name.q
 4#$ -cwd
 5#$ -V
 6#$ -o output.txt
 7#$ -e error.txt
 8#$ -pe <parallel_environment_name> 40
 9#$ -l virtual_free=6000M
10#$ -l hostname=<node-prefix*>
11
12# write the name of the execution host to a file
13hostname > TIME_STAMP
14# write down the starting time
15echo "start at:" >> TIME_STAMP
16date >> TIME_STAMP
17
18# the Conda environment has to be activated
19# in older Conda versions -V was needed using the following command
20#. activate /path/to/project_dir/conda_env
21# now you use
22conda activate /path/to/project_dir/conda_env
23
24snakemake -p --cores 40
25
26# write down the end time
27echo "end time:" >> TIME_STAMP
28date >> TIME_STAMP

With SGE the actual submission of the job to the cluster is then achieved through a single command:

1qsub <submit_script_name>

On an entire cluster

In case that potential load peaks caused by GATK are of no concern, an entire Cluster for OVarFlow is possible as well. Thereby unleashing all of the resources of the entire cluster to OVarFlow. This is probably the most efficient way of using OVarFlow.

The ability for cluster execution is build into Snakemake. But precaution has to be taken, that the Conda environment is available for the cluster jobs. One option is to export the current environment with the -V option of the qsub command:

1conda activate /path/to/project_dir/conda_env
2cd /path/to/project_dir/variant_calling
3snakemake -p --cores 20 --cluster 'qsub -V -cwd -b y -pe <parallel_environment> {threads}'

The above command will automatically use the number of {threads} for each command as defined in the workflow, while using a maximum of 20 threads in parallel for all currently submitted jobs. Certainly the <parallel_environment> argument must be replaced with the respective name of your local cluster environment. One drawback is, that the cluster management software will write four log files for each submitted job into the current directory. To tidy things up, it is advisable to create a log directory and include this into the cluster submission command:

1conda activate /path/to/project_dir/conda_env
2cd /path/to/project_dir/variant_calling
3mkdir logs_cluster
4snakemake -p --cores 20 --cluster 'qsub -V -cwd -o logs_cluster -e logs_cluster -b y -pe <parallel_environment> {threads}'

One thing to consider is, that the above command will reveal all of the users environment variables publicly (qstat --cores <job_number> under the entry env_list).

Cluster usage - Slurm

Slurm is an alternative cluster managment and job scheduling system. In recent years it gained popularity over SGE, which suffers from maintenance issues. Fortunately, Snakemake and Slurm work with one another just as well as SGE does.

On a single large cluster node

Submitting the entire workflow to a single cluster node is quite simple with slurm. The srun command accomplishes this purpose. Slurm’s job submission system automatically exports the user’s current environment (default option --export=ALL). To benefit from this feature, the corresponding Conda environment should be activated first. Afterwards, the workflow can be tested through a dry run:

1conda activate /path/to/project_dir/conda_env
2srun -c 28 --mem 50g snakemake -np --cores 26 --resources mem_gb=48

The above command also illustrates how resource requests can be made. Many cluster systems are very strict about resource requests. In this case, a job won’t be allowed to use more CPU resource than requested, and if a job consumes more memory than requested, it might even be terminated. The above command performs resource requests in two ways. First, 28 cores and 50 Gb of memory are requested from the cluster management system (the values used are just an example and need to be adjusted to the size of the given dataset). Second, to ensure that the requested resource are not exceeded, the Snakemake scheduler is configured to use slightly less resources. It should be noted that resource requests are neither automatically transfer nor matched between Slurm and Snakemake. Therefore, both must be set manually. Once the dry run has been successfully executed, the actual workflow can be started:

1srun -c 28 --mem 50g snakemake -p --cores 26 --resources mem_gb=48

On an entire cluster

Of course, when dealing with very large datasets, it is not reasonable to restrict oneself to the use of a single cluster node. Snakemake itself is capable of handling such a large scale submission, but on the side of Slurm, the sbatch command has to be used this time:

1conda activate /path/to/project_dir/conda_env
2snakemake --default-resources mem_gb=16 -p --cores 200 --cluster 'sbatch -c {threads} --mem {resources.mem_gb}G'

Again, the Conda environment is activated first. This time, no dry run is performed, but the workflow is executed directly. Depending on the size of the given dataset and cluster, a reasonable large number of n cores should be used. Snakemake automatically schedules resource requests ({threads} and {resources.mem_gb}) for each individual job. These resource requests are based upon the setting made in the config.yaml file. For instance, memory requests are inferred from the configured heap size and Java memory overhead (see configuration section). By this mechanism, reasonable settings are selected for most jobs. Not every rule of the workflow has a predefined memory requirement. For these rules, a default of 16 Gb is requested in the above command.

Trouble shooting

OVarFlow has been designed for easy and automatic execution of the variant calling workflow. As the underlying processes are quite complex and involve a lot of various software tools, runtime errors can not be excluded within those tools, that OVarFlow is ultimately relying on. Furthermore the respective computational environment can be a source of failure as well. Variant calling will involve high performance computing most of the time. This involves a variety of hardware resources with servers dedicated to computation and others dedicated to data storage. Failure in this environment might result in unavailability of data, causing a running calculation of OVarFlow to fail and ultimately leading to termination of the workflow.

The good news is, thanks to its Snakemake basis, OVarFlow can often recover from such situations. Albeit manual intervention might be needed in such circumstances.

Often Snakemake itself can pick up an interrupted workflow. Executing a dry run might give first insights:

1snakemake -np --rerun-incomplete

If this command succeeds the real execution can be performed, of course while specifying a reasonable number of threads (--cores) to accelerate calculations. Also it might be interesting to see the reason why a specific command is executed (--reason):

1snakemake -p --cores <number_of_threads> --rerun-incomplete  --reason

In case that Snakemake was interrupted previously, it might block re-execution (so-called lock):

Error: Directory cannot be locked. Please make sure that no other Snakemake process is trying
to create the same files in the following directory:
/path/to/project_dir/variant_calling
If you are sure that no other instances of snakemake are running on this directory, the
remaining lock was likely caused by a kill signal or a power loss. It can be removed with the
--unlock argument.

In such a situation the lock can be removed and the workflow can be rerun:

1snakemake --unlock
2snakemake -p --cores <number_of_threads> --rerun-incomplete --reason

Some times more manual interaction is necessary. First of all it’s good to know in which step the error occurred. Therefore the following points could be checked:

  • The console log of the commands that were executed by Snakemake.

  • The log messages of every single command, written within the log directory.

  • The files and directories that were already created by the workflow.

  • Is every created file complete? Compared with other files of the same type are file sizes reasonable and are the expected index files (e.g. .bam and .bam.bai) present?

If certain files are corrupted those can be removed manually via the rm command. In cases where a disaster recovery is not possible, the whole workflow can be started newly by removing the directories that were created (including logs and .snakemake) and restarting the workflow:

1rm -rf 00_FastQC [01_...] logs .snakemake
2snakemake -p --cores <number_of_threads>

Error identification

The ability to restart a failed workflow can be very helpful. But this feature is only useful, if the problem that cause the workflow to fail can be identified. Here, an example of a failed workflow will be shown and also how to identify the causative problem.

The workflow was terminated with the following final messages:

...
Finished job 3039.
1111 of 3097 steps (36%) done
[Sat Jul 10 03:56:53 2021]
Finished job 2816.
1112 of 3097 steps (36%) done
[Sat Jul 10 04:11:46 2021]
Finished job 2939.
1113 of 3097 steps (36%) done
[Sat Jul 10 04:20:47 2021]
Finished job 2814.
1114 of 3097 steps (36%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /path/to/project_dir/.snakemake/log/2021-07-08T142952.561603.snakemake.log
(conda: conda_env)user@host:/path/to/project_dir/variant_calling$

Snakemake returned to the shell prompt, after an exhaustive series of jobs was finished, of which one had failed. This resulted in the above error message, only the already running jobs were finished, and no further jobs were started. The actual error message of the job that failed is not shown. It can be identified by either scrolling upwards, till the message appears or by viewing the stated log file (which is a copy of the messages shown on the shell). In doing so, a more comprehensive error message is found:

...
Error in rule mark_duplicates:
  jobid: 769
  output: 03_mark_duplicates/sample-82.bam, 03_mark_duplicates/sample-82.txt
  log: logs/03_mark_duplicates/sample-82.log (check log file(s) for error message)
  shell:
      export _JAVA_OPTIONS=-Xmx2G
      gatk --java-options -XX:ParallelGCThreads=2 MarkDuplicates -I 02_sort_gatk/sample-82.bam -O 03_mark_duplicates/sample-82.bam -M 03_mark_duplicates/sample-82.txt -MAX_FILE_HANDLES 300 --TMP_DIR ./GATK_tmp_dir/ 2> logs/03_mark_duplicates/sample-82.log
      one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
  cluster_jobid: Your job 963665 ("snakejob.mark_duplicates.769.sh") has been submitted

Error executing rule mark_duplicates on cluster (jobid: 769, external: Your job 963665 ("snakejob.mark_duplicates.769.sh") has been submitted, jobscript: /path/to/project_dir/variant_calling/.snakemake/tmp.jz4k0i0y/snakejob.mark_duplicates.769.sh). For error details see the cluster log and the log files of the involved rule(s).
...

This error message educates us about the exact job and its rule that failed. In this case, a job spawned from the rule mark_duplicates failed. That’s a step forward, but the actual error message is still hidden in the reported log file logs/03_mark_duplicates/sample-82.log:

[Fri Jul 09 14:07:25 CEST 2021] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 45.92 minutes.
Runtime.totalMemory()=1908932608
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOfRange(Arrays.java:3664)
        at java.lang.String.<init>(String.java:207)
        at java.lang.String.substring(String.java:1969)
        at picard.sam.util.ReadNameParser.getLastThreeFields(ReadNameParser.java:146)
        at picard.sam.util.ReadNameParser.addLocationInformation(ReadNameParser.java:83)
        at picard.sam.markduplicates.MarkDuplicates.buildReadEnds(MarkDuplicates.java:661)
        at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:552)
        at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:257)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:301)
        at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)

The above error message is a Java stack trace caused by the GATK tool MarkDuplicates. This educates us about the root cause of the problem: Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded. The OutOfMemoryError tells us, that insufficient Java heap space was provided for this specific job. This problem can easily be fix within the workflow, by providing a larger heap space to MarkDuplicates in the configuration file config.yaml (also see the section “Configuration & adaptation”). The required entry could look like this:

heapSize:
    MarkDuplicates : 4

Thereby, the standard heap size would be doubled. After this simple fix, the workflow can be restarted and finished. Generally, the workflow tries to find a balance between resource efficiency and broad applicability. In some special cases, individual GATK applications need to be provided with more resource, which would be wasteful when evaluating smaller datasets.

Reproducibility

Reproducibility has been among the primary scopes of OVarFlow. Three components of OVarFlow serve this purpose:

  • the workflow itself (aka the Snakefile),

  • the CSV file to document and configure data evaluation,

  • the Conda environment with its specific program versions.

The yml (OVarFlow_dependencies.yml) file only includes the major dependencies of OVarFlow. To automatically obtain the latest program versions for new variant callings, no specific program versions are denoted here. To achieve perfect reproducibility all installed programs as well as their versions have to be obtained from a given Conda environment. This is a part of the management of Conda environments. Basically this boils down to the creation of a yml file of the given environment, which includes all programs and their version numbers:

1conda activate /path/to/project_dir/conda_env
2conda env export > conda_environment.yml

The Conda environment can easily be recreated out of this yml file:

1conda env create -f conda_environment.yml