Getting Started¶
Important: Be sure to update any values in angle brackets (<...>
). File paths can either be absolute or relative to the working directory (usually where you run your snakemake
command).
Install
conda
(Python version 3.6 or later) with Miniconda or Anaconda. Ideally, create a project-specific conda environment. You can install Git using conda if need be.# Optional: Create a project-specific conda environment # conda create -n <project_name> "python>=3.6" # conda activate <project_name> conda install git
Clone the lcr-modules repository and the lcr-scripts repository.
git clone https://github.com/LCR-BCCRC/lcr-modules.git git clone https://github.com/LCR-BCCRC/lcr-scripts.git
Install the custom
oncopipe
python package included in the lcr-modules repository, which will also install dependencies such assnakemake
andpandas
.cd lcr-modules/ pip install -e oncopipe/
Test your environment with the Demo Snakefile using the following
snakemake
command. If you want to actually run the test, check out Running the Demo Project.cd demo nice snakemake --dry-run --use-conda all
Create a Sample Table as a tab-delimited file with these Required Columns. The following sample table is an example taken from the Demo Project (
demo/data/samples.tsv
).sample_id seq_type patient_id tissue_status genome_build TCRBOA7-N-WEX capture TCRBOA7 normal grch37 TCRBOA7-T-WEX capture TCRBOA7 tumour grch37 TCRBOA7-T-RNA mrna TCRBOA7 tumour grch37 Add the lcr-modules Configuration to your project configuration YAML file (e.g.
config.yaml
). If you have unpaired tumour samples, you will need to add aunmatched_normal_ids
section. Check out Handling Unpaired Tumours section for more information and Within a Configuration File for an example of how this is done.# Update `scratch_directory` if you have a space to store large intermediate files lcr-modules: _shared: lcr-modules: "<path/to/lcr-modules>" lcr-scripts: "<path/to/lcr-scripts>" root_output_dir: "results/" scratch_directory: null
Load the Sample Table, Specify the Input Samples, and add the Reference Files Workflow by including the following lines in your project snakefile anywhere before the module snakefiles (see next step).
import oncopipe as op SAMPLES = op.load_samples("<path/to/samples.tsv>") subworkflow reference_files: workdir: "</path/to/reference_directory/>" snakefile: "<path/to/lcr-modules/workflows/reference_files/1.0/reference_files.smk>" configfile: "<path/to/lcr-modules/workflows/reference_files/1.0/config/default.yaml>"
Note
BCGSC Users
You can set the
reference_files
working directory (workdir
) to the following file path in order to benefit from pregenerated reference files:/projects/bgrande/reference_files
Include and configure the modules you want to run by adding the following lines to your project snakefile.
Important
- Any fields that need to be updated by the user will contain
__UPDATE__
in the default Module Configuration. You will run into errors if you fail to update these fields. - We recommend following the order shown below: (1) load the default module configuration files; (2) load the project-specific configuration file; and (3) include the module snakefiles.
- The following example assumes that these values are updated in the project-specific configuration file. For more information, check out Updating Configuration Values.
# Load the default configuration file for each module configfile: "<path/to/lcr-modules/modules/manta/2.0/config/default.yaml>" configfile: "<path/to/lcr-modules/modules/star/1.0/config/default.yaml>" # ... # Load your project-specific configuration and set the default sample list configfile: "<config.yaml>" config["lcr-modules"]["_shared"]["samples"] = SAMPLES # Include the snakefile for each module include: "<path/to/lcr-modules/modules/manta/2.0/manta.smk>" include: "<path/to/lcr-modules/modules/star/1.0/star.smk>" # ...
- Any fields that need to be updated by the user will contain
Launch snakemake by specifying the module target rule(s). See Snakemake Commands for suggestions on how to run snakemake.
nice snakemake --dry-run --use-conda --cores <cores> _<star>_all _<manta>_all
If you feel comfortable with the above steps, consider reading through the Advanced Usage. For example, you can use Conditional Module Behaviour to set different input file paths for different sequencing data types (e.g.
genome
andmrna
).
Running the Demo Project¶
Before running the Demo Project, you will need to download the Test Data, which is composed of exome and RNA sequencing data for a follicular lymphoma case. Specifically, the tumour sample has both exome and RNA data, whereas the normal sample only has exome data. We acknowledge the Texas Cancer Research Biobank, Baylor College of Medicine Human Genome Sequencing Center, and the study participants for graciously providing these open-access data. These data are described in more detail in the following paper. Important: Before downloading the data, you must first read and agree to the terms laid out in the Test Data README, which is a requirement for redistributing the data.
Becnel, L. B. et al. An open access pilot freely sharing cancer genomic data from participants in Texas. Sci. Data 3:160010 doi: 10.1038/sdata.2016.10 (2016).
You can download the test data and verify the checksums using the following commands. Note that this will overwrite the empty placeholder files. If you don’t have enough space where you cloned the repository, you can download the test data elsewhere and create symbolic links.
cd demo/data/
wget --recursive --no-parent --no-host-directories --cut-dirs=3 --execute robots=off --reject="index.html*" https://www.bcgsc.ca/downloads/lcr-modules/test_data/
md5sum --check checksums.txt
Note
BCGSC Users
You can replace the empty placeholder files with symbolic links to the local copies of the test data using the following command:
ln -sf /projects/bgrande/lcr-modules/test_data/*.{bam,bai,fastq.gz} ./demo/data/
At that point, you can technically run the Demo Snakefile by omitting the --dry-run
option from the command in the Getting Started instructions, but you might want to update the value set in demo/config.yaml
under scratch_directory
to an space where you can readily store large intermediate files (e.g. a directory without snapshots or backups). For more information check out the Common Shared Configuration Fields section.
If you are interested in learning how you can conditionally use the STAR BAM files for RNA-seq samples while using the BAM files in data/
for other samples, check out the Conditional Module Behaviour section.
Reference Files Workflow¶
The reference_files
workflow serves many purposes. In general, it simplifies the deployment of lcr-modules for any reference genome on any computer system. This approach ensures that the steps required to generate any reference file are tracked in a snakefile, ensuring their reproducibility. This is achieved by:
- Downloading the genome FASTA files and any additional reference files (e.g. Gencode transcript annotations).
- Converting the additional files to match the same chromosome system as the genome builds (e.g. UCSC vs NCBI vs Ensembl)
- Generate the required reference files from what was downloaded using snakemake rules.
The snippet below needs to be added to your project snakefile before you include any of the individual module snakefiles. It adds the reference_files
workflow as a sub-workflow in your project snakefile. Essentially, you gain access to the snakemake rules in that workflow, which you can trigger by passing files to the reference_files()
function. In this case, the module will ask for specific reference files (e.g. genome FASTA file, STAR index), and if the file doesn’t exist, the reference_files
sub-workflow will create them. For more details, check out the Snakemake Sub-Workflows documentation.
subworkflow reference_files:
snakefile:
"<path/to/lcr-modules/workflows/reference_files/1.0/reference_files.smk>"
configfile:
"<path/to/lcr-modules/workflows/reference_files/1.0/config/default.yaml>"
workdir:
"</path/to/reference_directory/>"
The configfile
field specifies the path to the default reference_files
configuration YAML file, which contains the information required for the workflow to run. The genome_builds
section is the most important part of this configuration file. It defines the details for each available genome build, including the download URL. This enables the portability and thus reproducibility of lcr-modules. Each genome build also has a version (i.e. grch37
or grch38
) and a provider (e.g. ensembl
, ucsc
). This metadata allows the reference_files
workflow to automatically convert between the chromosome names of different providers (e.g. with and without the chr
prefix).
The workdir
field specifies where the reference files will be created. Optionally, you can set this to a location shared between multiple lcr-modules users to avoid duplicating reference data. If you are considering building a shared reference directory, you might want to consider pre-populating it using the prepare_reference_files.smk
snakefile. This will generate all of the output files that the reference_files
workflow can produce for every genome build listed in the configuration file mentioned above. If you only need one genome build, you can remove the unnecessary genome builds from the configuration file. A bash script is included in the repository to perform this task:
workflows/reference_files/1.0/prepare_reference_files.sh </path/to/reference_directory> <num_cores>
One caveat with the reference_files
workflow is that the rules therein don’t have any names. This is due to a Snakemake limitation. Rule names had to be omitted because this workflow could be included in more than one module loaded by the user and Snakemake doesn’t allow duplicate rule names. As a result, you will be numbered rules (e.g. 1
, 2
, etc.) in your snakemake logs, such as the example shown below:
Job counts:
count jobs
1 1
1 2
1 3
7 _manta_augment_vcf
2 _manta_configure
2 _manta_dispatch
1 _manta_index_bed
3 _manta_input_bam
1 _manta_input_bam_none
3 _manta_output_bedpe
7 _manta_output_vcf
2 _manta_run
3 _manta_vcf_to_bedpe
1 _star_input_fastq
1 _star_output_bam
1 _star_run
1 _star_symlink_sorted_bam
1 _star_symlink_star_bam
1 all
40
lcr-modules Configuration¶
# lcr-modules configuration
config["lcr-modules"]
One of snakemake’s most useful features is the ability to separate the workflow logic in the snakefiles from the tuneable parameters in the configuration files. lcr-modules is configured using Module Configuration and Shared Configuration. All configuration relating to lcr-modules is stored under the "lcr-modules"
key in the snakemake config
variable. This way, there is no risk of messing up any existing configuration created by the user.
Important: For brevity, the configuration under the "lcr-modules"
key will be referred to as “lcr-modules configuration”.
Module Configuration¶
# Module configuration
config["lcr-modules"]["<module_name>"]
Each module in the lcr-modules repository comes bundled with a default configuration to get users started. This module configuration is stored in the config/default.yaml
YAML file in module subdirectory. These YAML files load the module configuration under module name in the lcr-modules configuration. You can see this in the excerpt below, which is taken from the manta
module:
lcr-modules:
manta:
inputs:
# Available wildcards: {seq_type} {genome_build} {sample_id}
sample_bam: "__UPDATE__"
sample_bai: "__UPDATE__"
augment_manta_vcf: "{SCRIPTSDIR}/augment_manta_vcf/1.0/augment_manta_vcf.py"
# ...
The intent behind these module configuration files is that any field can be (and often should be) updated by the user. In fact, some fields must be updated before the module can be run. These are indicated by the __UPDATE__
placeholder in the default configuration file. In the above excerpt, the two input files sample_bam
and sample_bai
are set to __UPDATE__
, indicating that these must be updated by the user.
Important: Before running any module, you must search for any instances of __UPDATE__
in the default configuration file. See Updating Configuration Values for different approaches on how to override the default configuration for each module.
Updating Configuration Values¶
Within a Configuration File¶
If you followed the Getting Started instructions, you should have a section in your project configuration file for lcr-modules
(with at least the _shared
sub-section). One approach to updating configuration values is to add to this section. Important: One requirement for this to work is that you need to load your project configuration file after the default module configuration files. Again, if you followed the Getting Started instructions, this should already be the case.
By the way, there is nothing forcing you to store your project-specific configuration in the same file as the lcr-modules configuration. You can easily have a project.yaml
file loaded near the beginning of your snakefile and a lcr-modules.yaml
file loaded as described in the Getting Started instructions.
One of the main limitations of this approach is that you are restricted to value types that can be encoded in YAML format. For the most part, this means numbers, strings and booleans organized into lists or dictionaries. In other words, this precludes the use of functions as values, such as Input File Functions. If you need to specify functions, you will have to update configuration values Within the Snakefile, or use a hybrid approach.
Note
The value null
is parsed as None
in Python. If you provide the value None
in the YAML file, it will be parsed as the string "None"
.
The example YAML file below is taken from the Demo Configuration. You can see that it includes unmatched_normal_ids
under _shared
to indicate which normal samples to use for unpaired tumours for paired analyses (see Handling Unpaired Tumours). It also updates a number of configuration values for the star
and manta
modules. All of these fields contain __UPDATE__
in the modules’ respective default configuration file. The only exception is the scratch_subdirectories
field for the star
module, which was updated here to include the "mark_dups"
subdirectory such that the final BAM files from the module are also stored in the scratch directory.
# Taken from lcr-modules/demo/config.yaml
lcr-modules:
_shared:
lcr-modules: "../"
lcr-scripts: "../../lcr-scripts"
root_output_dir: "results/"
scratch_directory: "scratch/"
unmatched_normal_ids:
capture--grch37: "TCRBOA7-N-WEX"
star:
inputs:
sample_fastq_1: "data/{sample_id}.read1.fastq.gz"
sample_fastq_2: "data/{sample_id}.read2.fastq.gz"
reference_params:
star_overhang: "99"
scratch_subdirectories: ["star", "sort_bam", "mark_dups"]
manta:
inputs:
sample_bam: "data/{sample_id}.bam"
sample_bai: "data/{sample_id}.bam.bai"
Within the Snakefile¶
You can always update configurations values within the snakefile after the default configuration files have been loaded. The advantage of this approach is that you can update the value to anything that Python allows, including functions. This is incredibly powerful in snakemake thanks to Input File Functions and Parameter Functions. Also, oncopipe
includes some useful functions that make use of these snakemake features (e.g. Conditional Module Behaviour).
The main drawback of this approach is that it can be rather verbose, not very elegant, and as a result, not as readable. For instance, the equivalent of the above YAML file using this approach would look like this:
config["lcr-modules"]["_shared"]["unmatched_normal_ids"] = {
"capture--grch37": "TCRBOA7-N-WEX"
}
config["lcr-modules"]["star"]["inputs"]["sample_fastq_1"] = "data/{sample_id}.read1.fastq.gz"
config["lcr-modules"]["star"]["inputs"]["sample_fastq_2"] = "data/{sample_id}.read2.fastq.gz"
config["lcr-modules"]["star"]["reference_params"]["star_overhang"] = "99"
config["lcr-modules"]["star"]["scratch_subdirectories"] = ["star", "sort_bam", "mark_dups"]
config["lcr-modules"]["manta"]["inputs"]["sample_bam"] = "data/{sample_id}.bam"
config["lcr-modules"]["manta"]["inputs"]["sample_bai"] = "data/{sample_id}.bam.bai"
Alternatively, some of the redundancy can be avoided by using the Snakemake update_config() Function, as follows. However, this alternative approach isn’t much better. It takes up as many (if not more) lines, especially if you format the code to be readable.
import snakemake as smk
smk.utils.update_config(config["lcr-modules"]["_shared"], {
"unmatched_normal_ids": {
"capture--grch37": "TCRBOA7-N-WEX"
}
})
smk.utils.update_config(config["lcr-modules"]["star"], {
"inputs": {
"sample_fastq_1": "data/{sample_id}.read1.fastq.gz",
"sample_fastq_2": "data/{sample_id}.read2.fastq.gz"
},
"reference_params": {"star_overhang": "99"},
"scratch_subdirectories": ["star", "sort_bam", "mark_dups"]
})
smk.utils.update_config(config["lcr-modules"]["star"], {
"inputs": {
"sample_bam": "data/{sample_id}.bam",
"sample_bai": "data/{sample_id}.bam.bai"
}
})
If you want a simpler syntax, you can consider using the Convenience Set Functions. That said, a good compromise might be to store as much of these configuration updates Within a Configuration File (i.e. anything that isn’t a function), and you can update values with functions Within the Snakefile.
Sample Table¶
sample_id | seq_type | patient_id | tissue_status | genome_build |
---|---|---|---|---|
TCRBOA7-N-WEX | capture | TCRBOA7 | normal | grch37 |
TCRBOA7-T-WEX | capture | TCRBOA7 | tumour | grch37 |
TCRBOA7-T-RNA | mrna | TCRBOA7 | tumour | grch37 |
One of the requirements for using lcr-modules is a sample table. This format was selected for its flexibility. Each sample can be annotated with any amount of metadata, but for the purposes of lcr-modules, there are only a few Required Columns. These columns allow the modules to understand the relationship between the samples, especially for tumour-normal pairing.
These requirements are encoded in schemas, which are stored and versioned in schemas/
. These schemas are used in conjunction with Snakemake Validation. If the sample table doesn’t confirm to a schemas that is required by a module, the user will given an informative error message. For example, the list of Required Columns below is encoded in the base-1.0.yaml
schema (located in schemas/base/
). The list of schemas will grow as modules are added with specific metadata requirements (e.g. strandedness of an RNA-seq library for expression quantification).
The only format requirement for the sample table is that it is a pandas DataFrame (i.e. pandas.DataFrame
). Hence, the format of the file on disk doesn’t matter. If you wish to use the oncopipe.load_samples()
convenience function, note that it defaults to parsing tab-delimited files, but this can be overriden using the sep
argument. The advantage of using oncopipe.load_samples()
is that it offers a straightforward method for renaming your columns to comply with the schema(s). See Loading and Renaming Columns for examples.
Entity–Relationship Model¶
Before describing the required columns, it is useful to consider the entities related to each sample, namely patient
, biopsy
, sample
, library
, dataset
, and alignment
. These entities relate to one another in the following ways:
Relationships between entities: Each patient has one or more biopsies (e.g. a tumour biopsy and a blood draw; tumour FF and FFPE biopsies). Each biopsy has one or more nucleic acid samples (e.g. DNA and RNA). Each sample has one or more sequencing libraries constructed from its nucleic acid samples (e.g. whole genome and RNA sequencing libraries for a tumour FF sample). Each sequenced library produces a a set of sequencing reads (i.e. a dataset) with one or more alignments (e.g. an hg19 and hg38 alignments), although there is generally a “canonical” alignment if more than one exists and thus a one-to-one relationship between datasets and alignments.
While the term “sample” generally refers to nucleic acid samples, lcr-modules uses the term to refer to the units of data that serve as input for the module, i.e. usually sequencing data in the form of FASTQ or BAM files. In most projects, there is a simple one-to-one relationship between these files and nucleic acid samples. In more complex projects where nucleic acid samples have more than one data file, the sample IDs will need to incorporate information to prevent duplicates.
Required Columns¶
Check out the Loading and Renaming Columns section if your sample table has some of the required columns under different names. It also features a demonstration of the oncopipe.load_samples()
convenience function you can use to load your TSV/CSV sample table. The Adding and Transforming Columns section is useful if you lack some of the required columns or can derive them from existing columns.
seq_type
– Sequencing data type¶
The most common values for this column are genome
(whole genome sequencing), mrna
(RNA sequencing), capture
(hybridization-capture or exome sequencing), and mirna
(miRNA sequencing). While lcr-modules
can handle any value for seq_type
, the modules are pre-configured for these common values. If you have new seq_type
values, you can configure them for modules of interest; this is explained in the Configuring New Sequencing Data Types section under Advanced Usage.
sample_id
– Sample identifiers¶
Every seq_type
and sample_id
pair must be unique. In other words, if a tumour sample was sequenced using different technologies (e.g. whole genome and RNA sequencing), you can use the same sample ID since eachdata file will have a different seq_type
(e.g. genome
and mrna
, respectively). On the other hand, if you have been naming your samples based on patient ID and you have tumour-normal pairs, you will need to differentiate their sample IDs (e.g. with “T” and “N” suffixes). Similarly, if the same tumour sample has both FF and FFPE data files, you will also need to differentiate their sample IDs (e.g. with “FF” and “FFPE” suffixes).
tissue_status
– Tumour or normal¶
This column classifies the samples as either tumour
(or tumor
) and normal
. This information is required for tumour-normal paired analyses such as somatic variant calling. If you lack a matched normal samples, most modules support being run with an unmatched normal sample with the obvious caveats that the results will not be as clean. Check out Handling Unpaired Tumours for more information on how to achieve this.
patient_id
– Patient identifiers¶
This column groups samples that originate from the same patient, i.e. that share the same underlying germline sequence. This information is primarily used in conjunction with the tissue_status
column to generate all possible tumour-normal pairs from the list of samples.
genome_build
– Reference genome build¶
This column is only required if you have alignment (i.e. samples) using different genome builds. Otherwise, lcr-modules
will assume that the single set of reference data (e.g. lcr-modules/references/hg38.yaml
) that you load is the one to use.
Loading and Renaming Columns¶
For your convenience, the oncopipe.load_samples()
function is provided to easily load your samples as a pandas DataFrame. By default, the function assumes tab-delimited files, but you can change this using the sep
argument. The function can also convert some of your columns to lowercase using the to_lowercase
argument, which is useful to comply with some of the schemas. By default, it converts the tissue_status
column to lowercase. It thus becomes trivial to load a sample table.
import oncopipe as op
SAMPLES = op.load_samples("samples.tsv")
If your sample table uses different column names than those listed in Required Columns, you can use the oncopipe.load_samples()
function to rename your columns. For example, let’s say you already have a sample table, but the sample ID and patient ID columns are named sample
and patient
rather than sample_id
and patient_id
. You can easily achieve this as follows:
import oncopipe as op
SAMPLES = op.load_samples("samples.tsv", sample_id = "sample", patient_id = "patient")
Alternatively, if the column names in your sample table differ systematically from the expected column names, you can rename them by passing a function to the renamer
argument. You can also pass an anonymous lambda
function. For instance, if you use two-letter prefixes with a period delimiter to indicate which entity a column describes (e.g. pt.
for patient-related columns, lb.
for library-related columns, etc.), you can remove the prefix from all columns using a regular expression with the following code:
import re
import oncopipe as op
remove_prefix = lambda x: re.sub(r"[a-z]{2}\.", "", x)
SAMPLES = load_samples("samples.tsv", renamer=remove_prefix)
Adding and Transforming Columns¶
If your sample table is missing a required column that has the same value for every sample (e.g. genome_build
), you can easily add the missing column in your snakefile using standard pandas syntax as follows:
import oncopipe as op
SAMPLES = op.load_samples("samples.tsv")
SAMPLES["genome_build"] = "hg38"
On the other hand, if your sample table is missing a required column that has different values for different samples, you can handle this one of two ways. If you can derive the missing column from existing columns, you can use standard pandas syntax to fill in the missing column. Otherwise, you can always resort to manually adding the missing column in the sample table on disk. The example below shows how the pandas syntax can be used to derive a tissue_status
column by checking whether the sample_id
column ends with the letter “T”.
import oncopipe as op
SAMPLES = op.load_samples("samples.tsv")
SAMPLES["tissue_status"] = SAMPLES["sample_id"].str.endswith("T").map({True: "Tumour", False: "Normal"})
A similar approach can be taken if you have the columns, but they are formatted differently. For instance, if you encoded your sequencing data types as WGS
and Exome
instead of genome
and capture
, respectively, you can use the map()
method to switch to the expected values, as follows:
import oncopipe as op
SAMPLES = op.load_samples("samples.tsv")
SAMPLES["seq_type"] = SAMPLES["seq_type"].map({"WGS": "genome", "Exome": "capture"})
Specify the Input Samples¶
Once you have a sample table, you need to inform the modules which samples they need to run on. Normally, this is accomplished by storing the customized sample table under the "samples"
key in the Module Configuration. Because each module automatically filters for samples whose seq_type
appear in their respective Pairing Configuration, the user doesn’t need to worry about pre-filtering the samples. For example, the user doesn’t need to filter for RNA-seq samples for the star
RNA-seq alignment module. That said, if the user had RNA-seq samples they didn’t want processed by the star
module, they can remove the samples in question and set this pre-filtered sample table as the value for the "samples"
key.
However, since most users will probably want to run all samples through all applicable modules, it is possible to avoid the step of setting the sample table for each module. To skip this step, you can simply set the full sample table under the "samples"
key in the Shared Configuration ("_shared"
). This is the method used in the Getting Started instructions. The Shared Configuration section explains why this works.
Snakemake Commands¶
Note: Don’t forget to update any values in angle brackets (<...>
).
Snakemake Profiles¶
The most convenient way of running snakemake is using snakemake profiles. Each profile contains a YAML file that dictates the default command-line options to use. This way, you don’t have to remember all those snakemake options.
GSC Snakemake Profiles¶
Make sure you first install the custom GSC snakemake profiles using these instructions. Then, you can use each profile using these commands.
Explicit Commands¶
If you prefer to spell out all of the command-line options in your snakemake commands, example commands are included below. These may eventually become out of sync with the above snakemake profiles. Feel free to compare with the list of arguments for local usage or cluster usage.
Local Usage¶
# See below for determining <cores>
nice snakemake --printshellcmds --use-conda --cores <cores> <targets>
Cluster Usage¶
nice snakemake --cluster-sync "srun --partition=all --ntasks=1 --nodes=1 --output=none --error=none --job-name={rule} --cpus-per-task={threads} --mem={resources.mem_mb}" --max-jobs-per-second=5 --max-status-checks-per-second=10 --local-cores=1 --latency-wait=120 --jobs=1000 --default-resources="mem_mb=2000" --printshellcmds --use-conda <targets>
Extra information¶
Determining Value for --cores
¶
To determine the number of cores to grant to snakemake, compare the number of installed cores and the current load on the server. These values can either be obtained precisely using the commands below, or they can be estimated by looking at the output of the htop
command. I generally select a value for --cores
equal to the number of installed cores minus the server load minus 10-20 to leave some buffer.
# Get the number of installed logical cores
nproc
# Get the average server load over the past 5 minutes
cut -d " " -f 2 /proc/loadavg
Increasing ulimit
¶
snakemake tends to spawn A LOT of processes and open A LOT of files depending on the number of running and pending jobs. You may eventually start running into cryptic errors about processors not being able to start or files not being able to be opened. This happens when you run into user limits. You can get around this issue by increasing the user limits with the ulimit
command. However, there are hard limits set by administrators that determine the maximum permitted for non-admin users. You can always ask your administrators to increase these hard limits for certain machines to run snakemake.
GSC ulimit
Setup¶
GSC users can include the following code in their .bashrc
file to increase their ulimits based on the server. Notice how the n104
numbers head node has a much higher hard limit than the other head nodes. This is because it was manually increased when n104
was the only head node. For this reason, it is recommended that GSC users specically log into n104
instead of numbers
, which will assign you to a random head node.
# Only change these values for interactive shells
if [[ $- == *i* ]]; then
if [[ "$HOSTNAME" == "n104" ]]; then
# Change the max number of processes
ulimit -u 32768
# Change the max number of file descriptors
ulimit -n 100000
fi
fi
Creating nice
Processes¶
You will notice that the snakemake
commands below are all prepended with nice
. Briefly, this has the effect of lowering the priority of your snakemake process. Now, you’re probably wondering why would you ever want to do that. Granted, compute resources should be utilized on a first come, first served basis, but in practice, not every user will pay close attention to who is already running jobs on a server.
Ultimately, it doesn’t matter whether this act is intentional, an accident, or due to insufficient knowledge of how to manage shared compute resources. If someone launches a job that uses more cores than are available, your snakemake process will be competing for CPU time, and this will make both processes take longer to complete.
In this situation, we should fall back on the motto from the wise Michelle Obama: “When they go low, we go high.” In this case, we follow this rule quite literally, because the nice
command will increase the “niceness” value of your snakemake process, which will cede CPU time to competing processes with lower (usually default) “niceness” values until they’re done.
Submitting Cluster Jobs Remotely¶
It is possible to submit jobs to a cluster remotely via SSH. This could be useful in situations where you have quick jobs that you don’t want to submit to the cluster, but you also don’t want to run locally on the cluster head node. Important: This section assumes that you have SSH keys set up, allowing SSH login to the head node without entering a password.
The command below differs from the explicit command above simply by prepending the srun
command in --cluster-sync
with ssh <head_node>
, where <head_node>
is the cluster head node where you run srun
normally. You can now increase the value for --local-cores
(see above for how to determine this value).
nice snakemake --local-cores=<cores> --cluster-sync "ssh <head_node> srun --partition=all --ntasks=1 --nodes=1 --output=none --error=none --job-name={rule} --cpus-per-task={threads} --mem={resources.mem_mb}" --max-jobs-per-second=5 --max-status-checks-per-second=10 --latency-wait=120 --jobs=1000 --default-resources="mem_mb=2000" --printshellcmds --use-conda <targets>
Advanced Usage¶
Directory Placeholders¶
When specifying any value in the module configuration, you can use the following shorthands as placeholders in the string. They will be replaced with the actual values dynamically. See the Conditional Module Behaviour. section below for example usage.
{REPODIR}
: Thelcr-modules
repository directory. This corresponds to therepository
value under_shared
in thelcr-modules
configuration.{MODSDIR}
: The current module subdirectory. This corresponds to{REPODIR}/modules/<name>/<version>
.{SCRIPTSDIR}
: Thelcr-scripts
repository directory. This corresponds to thelcr-scripts
value under_shared
in thelcr-modules
configuration.
Convenience Set Functions¶
The Getting Started instructions demonstrate that everything is configured using the same snakemake config
nested dictionary object, generally under the "lcr-modules"
key. While transparent, it results in verbose code, such as:
config["lcr-modules"]["manta"]["inputs"]["sample_bam"] = "data/{sample_id}.bam"
Alternatively, you can use the so-called convenience “set functions” to simplify the code somewhat. In order to use them, you must first enable them. Behind the scenes, the snakemake config
object is stored internally for easy access. Note that you only need to use the oncopipe.enable_set_functions()
function once.
import oncopipe as op
op.enable_set_functions(config)
The first set function you can use is oncopipe.set_samples()
, which sets the samples you want to use at the shared or module level. The first argument corresponds to the module name (or "_shared"
), and all subsequent arguments should be sample tables each formatted as a pandas DataFrame. This function automatically concatenates the data frames that are provided. Here, SAMPLES
is the complete sample table, whereas GENOMES
and CAPTURES
are sample subsets generated from SAMPLES
using oncopipe.filter_samples()
.
import oncopipe as op
op.set_samples("_shared", SAMPLES)
op.set_samples("_shared", GENOMES, CAPTURES)
The second function you can use is oncopipe.set_input()
, which sets the given input for a module. Just like oncopipe.set_samples()
, the first argument is the module name, but this function should not be used for "shared"
. The second argument is the name of the input file as listed in the module’s configuration file. Lastly, the third argument is the value you wish to provide for that input file, which generally is a string value containing the available wildcards (once again, as listed in the module’s configuration file). That said, you could provide a conditional value as described below in Conditional Module Behaviour.
import oncopipe as op
op.set_input("manta", "sample_bam", "data/{sample_id}.bam")
Currently, only samples
and inputs
can be updated using these convenience set functions, but a more general set function will be implemented soon.
Conditional Module Behaviour¶
Sometimes, a parameter or input file depends on some sample attribute. In snakemake, this kind of conditional behaviour is usually achieved with Input File Functions and Parameter Functions. oncopipe
offers two convenience functions that utilize these features to conditionally provide values depending on the value of a wildcard or in a column of the sample table, namely oncopipe.switch_on_wildcard()
and oncopipe.switch_on_column()
. These functions are useful for both module users and module developers. Read their documentation for more details, which you can access by clicking the function names above.
Essentially, both functions create switches, which return a value based on a sample attribute, which can be stored in a wildcard or a column in the sample table, respectively. For example, oncopipe.switch_on_wildcard()
takes in two arguments. The first argument is the name of the wildcard that contains the sample attribute on which the behaviour should be based. The second argument is a dictionary mapping possible values for that wildcard to the values that should be returned in the event of each wildcard. This is best illustrated by the example below. For more information, you can also check out the Conditional Module Behaviour section in the developer documentation.
Here, I am updating the Demo Snakefile such that the manta
module uses BAM files from different sources depending on the sequencing data type (seq_type
). Specifically, I want to configure the module to use the BAM files in the data/
directory unless the seq_type
is mrna
(i.e. RNA-seq data), in which case the module should use the BAM file produced by the star
module. Since seq_type
is a wildcard in our file names, we can use the oncopipe.switch_on_wildcard()
function.
For simplicity, I am only updating the sample_bam
input file, but the same would have to be done for the sample_bai
input file. Also, this code assumes you have import oncopipe as op
somewhere beforehand in your snakefile.
MANTA_BAM_SWITCH = op.switch_on_wildcard("seq_type", {
"genome": "data/{sample_id}.bam",
"capture": "data/{sample_id}.bam",
"mrna": "results/star-1.0/99-outputs/bam/mrna--{genome_build}/{sample_id}.bam",
})
config["lcr-modules"]["manta"]["inputs"]["sample_bam"] = MANTA_BAM_SWITCH
While this is a good start, the redundancy between genome
and capture
is less than ideal. To address this, we can use the _default
key to, well, set a default value. In other words, we can achieve the same thing using the code below.
MANTA_BAM_SWITCH = op.switch_on_wildcard("seq_type", {
"_default": "data/{sample_id}.bam",
"mrna": "results/star-1.0/99-outputs/bam/mrna--{genome_build}/{sample_id}.bam"
})
config["lcr-modules"]["manta"]["inputs"]["sample_bam"] = MANTA_BAM_SWITCH
In the code above, I knew where the star
module was going to output the BAM file. Alternatively, you can use the useful rules
variable in snakemake. This highlights another reason why it’s sometimes useful to update configuration values Within the Snakefile. Important: This will only work if you include the star
module before the code below. In other words, the code snippet will only work if you add it between the star
include and the manta
include.
MANTA_BAM_SWITCH = op.switch_on_wildcard("seq_type", {
"_default": "data/{sample_id}.bam",
"mrna": rules._star_output_bam.output.bam
})
config["lcr-modules"]["manta"]["inputs"]["sample_bam"] = MANTA_BAM_SWITCH
Lastly, we can make use of the Convenience Set Functions and simplify the code just a bit more. If we also update sample_bai
, the final code would look like the following, added after the star
include and before the manta
include.
op.enable_set_functions(config)
MANTA_BAM_SWITCH = op.switch_on_wildcard("seq_type", {
"_default": "data/{sample_id}.bam",
"mrna": rules._star_output_bam.output.bam
})
MANTA_BAI_SWITCH = op.switch_on_wildcard("seq_type", {
"_default": "data/{sample_id}.bam.bai",
"mrna": rules._star_output_bam.output.bam + ".bai
})
op.set_input("manta", "sample_bam", MANTA_BAM_SWITCH)
op.set_input("manta", "sample_bai", MANTA_BAI_SWITCH)
For more information, check out Conditional Module Behaviour.
Configuring New Sequencing Data Types¶
This section is a work in progress, but you should be able to get started by reading the Pairing Configuration section in the developer documentation. It’s not recommended to add new sequencing data types (seq_type
) under the _shared
key because that will trigger all modules to try to run on the new seq_type
. It’s best to configure the seq_type
on a module-by-module basis.