oncopipe package

Module contents

Creates an absolute symlink from any working directory.

Parameters:
  • src (str) – The source file or directory path.
  • dest (str) – The destination file path. This can also be a destination directory, and the destination symlink name will be identical to the source file name (unless directory).
  • overwrite (boolean) – Whether to overwrite the destination file if it exists.
oncopipe.as_one_line(text)

Collapses a triple-quoted string to one line.

Line endings do not need to be escaped like in a shell script. Spaces and tabs are stripped from each side of each line to remove the indentation included in triple-quoted strings.

This function is useful for long shell commands in a Snakefile, especially if it contains quotes that would need to be escaped (e.g., in an awk command).

Returns:A single line (i.e., without line endings) of text.
Return type:str
oncopipe.check_for_none_strings(config, name)

Warn the user if ‘None’/’null’ strings are found in config.

oncopipe.check_for_update_strings(config, name)

Warn the user if ‘__UPDATE__’ strings are found in config.

oncopipe.check_reference(module_config, reference_key=None)

Ensure that a required reference config (and file) is available.

If there is no ‘genome_build’ column in module_samples and there is only one loaded reference, this function will assume that the loaded reference is the reference to be used.

Parameters:
  • module_config (dict) – The module-specific configuration, corresponding to config[‘lcr-modules’][‘<module-name>’].
  • reference_key (str, optional) – The key for a required reference file.
Returns:

Return type:

None

oncopipe.cleanup_module(module_config)

Save module-specific configuration, sample, and runs to disk.

oncopipe.combine_lists(dictionary, as_dataframe=False)

Merges lists for matching keys in nested dictionary.

Parameters:
  • dictionary (dict) –

    Nested dictionaries where the key names match up.

    {'genome': {'field1': [1, 2, 3],
                'field2': [4, 5, 6]},
    'mrna': {'field1': [11, 12, 13],
            'field2': [14, 15, 16]}}
    
  • as_dataframe (boolean, optional) – Whether the return value is coerced to pandas.DataFrame.
Returns:

The type of the return value depends on as_dataframe. If as_dataframe is False, the output will look like:

{'field1': [1, 2, 3, 11, 12, 13],
'field2': [4, 5, 6, 14, 15, 16]}

If as_dataframe is True, the output will look like:

    field1  field2
0       1       4
1       2       5
2       3       6
3      11      14
4      12      15
5      13      16

Return type:

dict or pandas.DataFrame

oncopipe.create_formatter(wildcards, input, output, threads, resources, strict)

Create formatter function based on rule variables.

oncopipe.discard_samples(samples, **filters)

Convenience wrapper around filter_samples.

oncopipe.enable_set_functions(config)

Enable the set_* oncopipe convenience functions.

Parameters:config (dict) – The Snakemake configuration nested dictionary.
oncopipe.filter_samples(samples, invert=False, **filters)

Subsets for rows with certain values in the given columns.

Parameters:
  • samples (pandas.DataFrame) – The samples.
  • invert (boolean) – Whether to keep or discard samples that match the filters.
  • **filters (key-value pairs) – Columns (keys) and the values they need to contain (values). Values can be any value or a list of values.
Returns:

A subset of rows from the input data frame.

Return type:

pandas.DataFrame

oncopipe.generate_pairs(samples, unmatched_normal_ids=None, **seq_types)

Generate tumour-normal pairs using sensible defaults.

Each sequencing data type (seq_type) is provided as separate arguments with a specified “pairing mode”. This mode determines how the samples for that seq_type are paired. Only the listed seq_type values will be included in the output. The available pairing modes are:

  1. matched_only: Only tumour samples with matched normal samples will be returned. In other words, unpaired tumour or normal samples will be omitted.

    generate_pairs(SAMPLES, genome='matched_only')
    
  2. allow_unmatched: All tumour samples will be returned whether they are paired with a matched normal sample or not. If they are not paired, they will be returned with an unmatched normal sample specified by the user. This mode must be specified alongside the ID for the sample to be paired with unpaired tumours as a tuple. This sample must be present in the samples table.

    generate_pairs(SAMPLES, genome=('allow_unmatched', 'PT003-N'))
    
  3. no_normal: All tumour samples will be returned without a paired normal sample. This is simply a shortcut for filtering for tumour samples, but this ensures that the column names will be consistent with other calls to generate_pairs().

    generate_pairs(SAMPLES, mrna='no_normal')
    
Parameters:
  • samples (pandas.DataFrame) – The sample table. This data frame must include the following columns: sample_id, patient_id, seq_type, and tissue_status (‘normal’ or ‘tumour’/’tumor’). If genome_build is included, no tumour-normal pairs will be made between different genome builds.
  • unmatched_normal_ids (dict, optional) – The mapping from seq_type and genome_build to the unmatched normal sample IDs that should be used for unmatched analyses. The keys must take the form of ‘{seq_type}–{genome_build}’.
  • **seq_types ({'matched_only', 'allow_unmatched', 'no_normal'}) – A mapping between values of seq_type and pairing modes. See above for description of each pairing mode.
Returns:

The tumour-normal pairs (one pair per row), but the normal sample is omitted if the no_normal pairing mode is used. Every column in the input samples data frame will appear twice in the output, once for the tumour sample and once for the normal sample, prefixed by tumour_ and normal_, respectively. An additional column called pair_status will indicate whether the tumour-normal samples in the row are matched or unmatched. If the normal sample is omitted due to the no_normal mode, this column will be set to no_normal.

Return type:

pandas.DataFrame

Examples

Among the samples in the SAMPLES data frame, the genome tumour samples will be paired with a matched normal samples if one exists or with the given unmatched normal sample (PT003-N) if no matched normal samples are present; the capture tumour samples will only be paired with matched normal samples; and the mrna tumour samples will be returned without matched or unmatched normal samples.

>>> PAIRS = generate_pairs(SAMPLES, genome=('allow_unmatched', 'PT003-N'),
>>>                        capture='matched_only', mrna='no_normal')
oncopipe.generate_runs(samples, pairing_config=None, unmatched_normal_ids=None, subgroups=('seq_type', 'genome_build', 'patient_id', 'tissue_status'))

Produces a data frame of tumour runs from a data frame of samples.

Here, a ‘tumour run’ can consist of a tumour-only run or a paired run. In the case of a paired run, it can either be with a matched or unmatched normal sample.

Parameters:
  • samples (pandas.DataFrame) – The samples.
  • pairing_config (dict, optional) – Same as generate_runs_for_patient_wrapper(). If left unset (or None is provided), this function will fallback on a default value (see oncopipe.DEFAULT_PAIRING_CONFIG).
  • unmatched_normal_ids (dict, optional) – The mapping from seq_type and genome_build to the unmatched normal sample IDs that should be used for unmatched analyses. The keys must take the form of ‘{seq_type}–{genome_build}’.
  • subgroups (list of str, optional) – Same as group_samples().
Returns:

The generated runs with columns matching the keys of the return value for generate_runs_for_patient().

Return type:

pandas.DataFrame

oncopipe.generate_runs_for_patient(patient_samples, run_paired_tumours, run_unpaired_tumours_with, unmatched_normal=None, unmatched_normals=None, run_paired_tumours_as_unpaired=False, **kwargs)

Generates a run for every tumour with and/or without a paired normal.

Note that ‘unpaired tumours’ in the argument names and documentation refers to tumours without a matched normal sample.

Parameters:
  • patient_samples (dict) – Lists of sample IDs (str) organized by tissue_status (tumour vs normal) for a given patient. The order of the samples in each list is irrelevant.
  • run_paired_tumours (boolean) – Whether to run paired tumours. Setting this to False is useful for naturally unpaired analyses (e.g., for RNA-seq).
  • run_unpaired_tumours_with ({ None, 'no_normal', 'unmatched_normal' }) – What to pair with unpaired tumours. This cannot be set to None if run_paired_tumours_as_unpaired is True. Provide value for unmatched_normal argument if this is set to ‘unmatched_normal’.
  • unmatched_normal (namedtuple, optional) – The normal sample to be used with unpaired tumours when run_unpaired_tumours_with is set to ‘unmatched_normal’.
  • unmatched_normals (dict, optional) – The normal samples to be used with unpaired tumours when run_unpaired_tumours_with is set to ‘unmatched_normal’. Unlike unmatched_normal, this parameter expects a mapping from “{seq_type}–{genome_build}” to Sample namedtuples. If this option is provided, it will take precedence over unmatched_normal.
  • run_paired_tumours_as_unpaired (boolean, optional) – Whether paired tumours should also be run as unpaired (i.e., separate from their matched normal sample). This is useful for benchmarking purposes or preventing unwanted paired analyses (e.g., in RNA-seq analyses intended to be tumour-only)
  • **kwargs (key-value pairs) – Any additional unused arguments (e.g, unmatched_normal_id).
Returns:

Lists of sample features prefixed with tumour_ and normal_ for all tumours for the given patient. Depending on the argument values, tumour-normal pairs may not be matching, and normal samples may not be included. The ‘pair_status’ column specifies whether a tumour is paired with a matched normal sample.

Return type:

dict

oncopipe.generate_runs_for_patient_wrapper(patient_samples, pairing_config)

Runs generate_runs_for_patient for the current seq_type/genome_build.

This function is meant as a wrapper for generate_runs_for_patient(), whose parameters depend on the sequencing data type (seq_type) and genome_build of the samples at hand. It assumes that all samples for the given patient share the same seq_type and genome_build.

Parameters:
  • patient_samples (dict) – Same as generate_runs_for_patient().
  • pairing_config (nested dict) –

    The top level is sequencing data types (seq_type; keys) mapped to dictionaries (values) specifying argument values meant for generate_runs_for_patient(). For example:

    {‘genome’: {‘run_unpaired_tumours_with’: ‘unmatched_normal’,
    ’unmatched_normal’: Sample(…)},
    ’mrna’: {‘run_paired_tumour’: False,
    ’run_unpaired_tumours_with’: ‘no_normal’}}
Returns:

Same as generate_runs_for_patient().

Return type:

dict

oncopipe.get_from_dict(dictionary, list_of_keys)

Access nested index/key in dictionary.

oncopipe.get_reference(module_config, reference_key)
oncopipe.group_samples(samples, subgroups)

Organizes samples into nested dictionary.

Parameters:
  • samples (pandas.DataFrame) – The samples.
  • subgroups (list of str) – Columns of samples by which to organize the samples. The order determines the nesting order.
Returns:

The number of levels is determined by the list of subgroups. The number of ‘splits’ at each level is based on the number of different values in the samples data frame for that column. The ‘terminal’ values are lists of samples, which are stored as named tuples containing all metadata for that row.

Return type:

nested dict

oncopipe.keep_samples(samples, **filters)

Convenience wrapper around filter_samples.

oncopipe.list_files(directory, file_ext)

Searches directory for all files with given extension.

The search is performed recursively. The function first tries to use the faster find UNIX tool before falling back on a slower Python implementation.

Parameters:
  • directory (str) – The directory to search in.
  • file_ext (str) – The file extension (excluding the period).
Returns:

The list of matching files.

Return type:

list of str

oncopipe.load_samples(file_path, sep='\t', to_lowercase=('tissue_status', ), renamer=None, **maps)

Loads samples metadata with some light processing.

The advantage of using this function over pandas.read_table() directly is that this function processes the data frame as follows:

  1. Can convert columns to lowercase.
  2. Can rename columns using either a renamer function or a set of key-value pairs where the values are the original names and the keys are the desired names.

If a renamer function is provided in addition to a set of key-value pairs, the renamer function will be used first.

Parameters:
  • file_path (str) – The path to the tabular file containing the sample metadata (including any required columns).
  • sep (str, optional) – The column separator.
  • to_lowercase (list of str, optional) – The columns to be converted to lowercase.
  • renamer (function or dict-like, optional) – A function that transforms each column name or a dict-like object that maps the original names (keys) to the desired names (values).
  • **maps (key-value pairs, optional) –

    Pairs that specify the actual names (values) of the expected columns (keys). For example, if you had a ‘sample’ column while lcr-modules expects ‘sample_id’, you can use:

    load_samples(…, sample_id = “sample”)

Returns:

Return type:

pandas.DataFrame

oncopipe.locate_bam(bam_directory=None, sample_keys=('sample_id', 'tumour_id', 'normal_id'), sample_bams=('sample_bam', 'tumour_bam', 'normal_bam'))

Locates BAM file for a given sample ID in a directory.

This function actually configures another function, which is returned to be used by Snakemake.

Parameters:
  • bam_directory (str, optional) – The directory containing all BAM files. If None is provided, then the default value of ‘data/’ will be used.
  • sample_keys (list of str, optional) – The possible wildcards that contain identifiers for samples with BAM files.
  • sample_bams (list of str, optional) – The respective names for the BAM file located for each sample in sample_keys in the dictionary returned by the input file function. For example, the BAM file for the sample specified in ‘sample_id’ wildcard will be stored under the key ‘sample_bam’ in the returned dictionary.
Returns:

A Snakemake-compatible input file function taking wildcards as its only argument. This function will return a dictionary of BAM files for any wildcards appearing in sample_keys under the corresponding keys specified in sample_bams.

Return type:

function

Creates a relative symlink from any working directory.

Parameters:
  • src (str) – The source file or directory path.
  • dest (str) – The destination file path. This can also be a destination directory, and the destination symlink name will be identical to the source file name (unless directory).
  • in_module (boolean) – If both the src and dest file are within a module results directory, setting this option to True will keep symlinks contained to the module directory. Always set to False for links that point outside module results directory. Example: dest = results/module/99-outputs/sample.vcf src = results/module/03-stomestep/sample.vcf results/module/99-outputs/sample.vcf -> ../03-somestep.sample.vcf
  • overwrite (boolean) – Whether to overwrite the destination file if it exists.
oncopipe.retry(value, multiplier=1.5, max_value=100000)

Creates callable that increases resource value on retries.

This function is intended for use with resources, especially memory (mem_mb).

Parameters:
  • value (int) – The value that will be multiplied on retries. This value will be used as is in the first try.
  • multiplier (float) – The factor that the value will be multiplied by on retries. This should usually be a number between 1 and 3.
  • max_value (int) – The maximum value that should be returned by this function, even on retries. This is meant to prevent excessively high requests that will never be accommodated by the cluster.
Returns:

The function that can be provided to the resource directive in a snakemake rule.

Return type:

function, which returns integer values

oncopipe.set_input(module, name, value)

Use given value for an input file in a module.

Parameters:
  • module (str) – The module name.
  • name (str) – The name of input file field. This is usually taken from the module’s configuration YAML file.
  • value (str or function) – The value to provide for the named input file. In most cases, this value will be a plain string, but you can also provide an input file function as per the Snakemake documentation, where the function would return strings. In all cases, the strings can make use of the wildcards that are usually listed in the configuration file.
oncopipe.set_samples(module, *samples)

Use given samples for a module.

Parameters:
  • module (str) – The module name. This can also be "_shared" for a value that should be inherited by all modules.
  • *samples (list of pandas.DataFrame) – One or more pandas data frames that will be concatenated before being used by the module. These data frames should contain sample tables as described in the documentation.
oncopipe.set_value(value, *keys)

Update lcr-modules configuration using simpler syntax.

This function will automatically create dictionaries if accessing a key that doesn’t exist and notify the user.

Parameters:
  • value (anything) – The value to be set at the location specified by *keys.
  • *keys (list of str) – All subsequent arguments will be collected into a list of strings, which specify the location where to set value. You do not need to include the "lcr-modules" key; it is assumed that you are accessing keys therein.
oncopipe.setup_module(name, version, subdirectories)

Prepares and validates configuration for the given module.

This function performs a number of convenient tasks:

  1. It ensures that the CFG variable doesn’t exist. This is intended as a safeguard since the modules use CFG as a convenient shorthand.
  2. It ensures that Snakemake meets the required version.
  3. It ensures that the required configuration is loaded.
  4. It initializes the module configuration with the _shared configuration, but recursively overwrites values from the module-specific configuration. In other words, the specific overrides the general.
  5. It ensures that the module configuration has the expected fields to avoid errors downstream.
  6. It’s updates any strings containing placeholders such as {REPODIR}, {MODSDIR}, and {SCRIPTSDIR} with the actual values.
  7. It validates the samples table using all of the schema YAML files in the module’s schemas/ folder.
  8. It configures, numbers, and creates the output and log subdirectories.
  9. It generates a table of runs consisting of tumour- normal pairs in case that’s useful.
  10. It will automatically filter the samples for those whose seq_type appear in pairing_config.
Parameters:
  • name (str) – The name of the module.
  • version (str) – The semantic version of the module.
  • subdirectories (list of str) – The subdirectories of the module output directory where the results will be produced. They will be numbered incrementally and created on disk. This should include ‘inputs’ and ‘outputs’.
Returns:

The module-specific configuration, including any shared configuration from config[‘lcr-modules’][‘_shared’].

Return type:

dict

oncopipe.setup_subdirs(module_config, subdirectories, scratch_subdirs=())

Numbers and creates module output subdirectories.

Parameters:
  • module_config (dict) – The module-specific configuration.
  • subdirectories (list of str) – The names (without numbering) of the output subdirectories.
  • scratch_subdirs (list of str, optional) –

    A subset of subdirectories that should be symlinked into the given scratch directory, specified under:

    config[“lcr_modules”][“_shared”][“scratch_directory”]

    This should not include ‘inputs’ and ‘outputs’, which only contain symlinks.

Returns:

The updated module-specific configuration with the paths to the numbered output subdirectories.

Return type:

dict

oncopipe.switch_on_column(column, samples, options, match_on='tumour', format=True, strict=False)

Pick an option based on the value of a column for a sample.

The function finds the relevant row in samples for either the tumour (the default) or normal sample, which is determined by the match_on argument. To find the row, the seq_type and tumour_id (or normal_id) wildcards are required.

The following special keys are available:

_default
If you provide a value under the key ‘_default’ in options, this value will be used if the column value is not among the other keys in options (instead of defaulting to “”).
_prefix, _suffix
If you provide values for the ‘_prefix’ and/or ‘_suffix’ keys in options, these values will be prepended and/or appended, respectively, to the selected value (including ‘_default’) as long as the selected value is a string (not a dictionary).
Parameters:
  • column (str) – The column name whose value determines the option to pick.
  • samples (pandas.DataFrame) – The samples data frame for the current module.
  • options (dict) – The mapping between the possible values in column and the corresponding options to be returned. Special key-value pairs can also be included (see above).
  • match_on ({"tumour", "normal"}) – Whether to match on the sample_id column in samples using wildcard.tumour_id or wildcard.normal_id.
  • format (boolean) – Whether to format the option using the rule variables.
  • strict (boolean) – Whether to include the bare wildcards in formatting. For example, if you have a wildcards called ‘seq_type’, without strict mode, you can access it with {seq_type} or {wildcards.seq_type}, whereas in strict mode, only the latter option is possible. This mode is useful if a wildcard has the same name as a rule variable, namely wildcards, input, output, threads, resources.
Returns:

A Snakemake-compatible input file or parameter function.

Return type:

function

oncopipe.switch_on_wildcard(wildcard, options, format=True, strict=False)

Pick an option based on the value of a wildcard for a run.

The following special keys are available:

_default
If you provide a value under the key ‘_default’ in options, this value will be used if the column value is not among the other keys in options (instead of defaulting to “”).
_prefix, _suffix
If you provide values for the ‘_prefix’ and/or ‘_suffix’ keys in options, these values will be prepended and/or appended, respectively, to the selected value (including ‘_default’) as long as the selected value is a string (not a dictionary).
Parameters:
  • wildcard (str) – The wildcard name whose value determines the option to pick.
  • options (dict) – The mapping between the possible values in column and the corresponding options to be returned. Special key-value pairs can also be included (see above).
  • format (boolean) – Whether to format the option using the rule variables.
  • strict (boolean) – Whether to include the bare wildcards in formatting. For example, if you have a wildcards called ‘seq_type’, without strict mode, you can access it with {seq_type} or {wildcards.seq_type}, whereas in strict mode, only the latter option is possible. This mode is useful if a wildcard has the same name as a rule variable, namely wildcards, input, output, threads, resources.
Returns:

A Snakemake-compatible input file or parameter function.

Return type:

function

oncopipe.walk_through_dict(dictionary, end_fn, max_depth=None, _trace=None, _result=None, **kwargs)

Runs a function at a given level in a nested dictionary.

If max_depth is unspecified, end_fn() will be run whenever the recursion encounters an object other than a dictionary.

Parameters:
  • dictionary (foo) – The dictionary to be recursively walked through.
  • end_fn (function) – THe function to be run once recursion ends, either at max_depth or when a non-dictionary is encountered.
  • max_depth (int, optional) – How far deep the recursion is allowed to go. By default, the recursion is allowed to go as deep as possible (i.e., until it encounters something other than a dictionary).
  • _trace (tuple, optional) – List of dictionary keys used internally to track nested position.
  • _result (dict) – Used internally to pass new dictionaries and avoid changing the input dictionary.
  • **kwargs (key-value pairs) – Argument values that are passed to end_fn().
Returns:

A processed dictionary. The input dictionary remains unchanged.

Return type:

dict