Benchmarking & Optimizations

Gallus gallus (chicken) has been used as a test organism. Not only is its reference genome GRCg6a of reasonable quality but it’s also of moderate size, with approx. 1.07 Gbp. The exact file versions used are:

  • Reference genome: GCF_000002315.6_GRCg6a_genomic.fna.gz

  • Reference annotation: GCF_000002315.6_GRCg6a_genomic.gff.gz

Whole genome sequencing (wgs) data were obtained from the European Nucleotide Archive (ENA), which offers direct download of fastq files. The study PRJNA306389 offers wgs data of varying sequencing depth. Two sequencing data sets of different sequencing depths were chosen:

  • Base count: 38,136,658,250, average coverage after mapping: 34-fold
  • Base count: 18,799,906,500, average coverage after mapping: 16-fold

All calculations were performed on a virtual computer provided by the German Network for Bioinformatics Infrastructure (de.NBI) (de.NBI cloud location at the Justus-Liebig-University Gießen). The virtual machine offered 28 cores and 64 GB main memory.

Exact CPU specification as provided by lscpu:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              28
On-line CPU(s) list: 0-27
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           28
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               61
Model name:          Intel Core Processor (Broadwell)
Stepping:            2
CPU MHz:             2593.906
BogoMIPS:            5187.81
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-27
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat

The used software version as provided by gatk --version was:

The Genome Analysis Toolkit (GATK) v4.1.7.0
HTSJDK Version: 2.21.2
Picard Version: 2.21.9

The used Java version as provided by java -version was:

openjdk version "1.8.0_152-release"
OpenJDK Runtime Environment (build 1.8.0_152-release-1056-b12)
OpenJDK 64-Bit Server VM (build 25.152-b12, mixed mode)

The used version of bwa as provided by bwa was:

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1188

The used version of samtools as provided by samtools --version was:

samtools 1.10
Using htslib 1.10.2
Copyright (C) 2019 Genome Research Ltd.

Resource usage of a specific process was monitored every 3 seconds via the command:

ps -p <pid of process> -o rss,%mem,%cpu | tail -1

Further code details can be found within the repository of OVarFlow. No additional demanding computations were performed during the recording of the resource usage.