Utils

Setup

import sctoolbox.utilities as utils

creators

Modules for creating files or directories.

sctoolbox.utils.creators.gitlab_download(internal_path: str, file_regex: str, host: str = 'https://gitlab.gwdg.de/', repo: str = 'sc_framework', branch: str = 'main', commit: str | None = None, out_path: str = './', private: bool = False, load_token: str = '/root/.gitlab_token', save_token: str = '/root/.gitlab_token', overwrite: bool = False, max_calls: int = 5, period: int = 60) → None[source]

Download file(s) from gitlab.

Parameters:

internal_path (str) – path to directory in repository
file_regex (str) – regex for target file(s)
host (str, default ‘https://gitlab.gwdg.de/’) – Link to host
repo (str, default 'sc_framework') – Name of the repository
branch (str, default 'main') – What branch to use
commit (Optional[str], default None) – What commit to use, overwrites branch
out_path (str, default './') – Where the fike/dir should be downloaded to
private (bool, default False) – Set true if repo is private
load_token (str, default 'pathlib.Path.home() / ".gitlab_token"') – Load token from file. Set to None for new token
save_token (str, default 'pathlib.Path.home() / ".gitlab_token"') – Save token to file
overwrite (bool, default False) – Overwrite file if it exsits in the directory
max_calls (int, default 5) – limit file download rate per period
period (int, default 60) – period length in seconds

Raises:

ValueError – If repository is inaccesible.

sctoolbox.utils.creators.setup_experiment(dest: str, dirs: list[str] = ['raw', 'preprocessing', 'Analysis']) → None[source]

Create initial folder structure.

Parameters:

dest (str) – Path to new experiment
dirs (list[str], default ['raw', 'preprocessing', 'Analysis']) – Internal folders to create

Raises:

Exception – If directory exists.

sctoolbox.utils.creators.add_analysis(dest: str, analysis_name: str, method: Literal['rna', 'atac'] = 'rna', dirs: list[str] = ['figures', 'data', 'logs'], starts_with: int = 1, **kwargs: Any) → None[source]

Create and add a new analysis/run.

Note: Only works for Notebooks until number 99. Needs to be adjusted if we exceed 89 notebooks.

Parameters:

dest (str) – Path to experiment.
analysis_name (str) – Name of the new analysis run.
method (Literal["rna", "atac"], default "rna") – Type of notebooks to download.
dirs (list[str], default ['figures', 'data', 'logs']) – Internal folders to create besides ‘notebooks’ directory.
starts_with (int, default 1) – Notebook the analysis will start with.
**kwargs (Any) – Forwarded to gitlab_download.

Raises:

FileNotFoundError – If path to experiment does not exist.
ValueError – If method is invalid.

sctoolbox.utils.creators.build_notebooks_regex(starts_with: int) → str[source]

Build regex for notebooks starting with given number.

Note: Only works up to 89. If we reach notebook 90 this function needs to be adjusted.

Parameters:: starts_with (int) – Starting number
Returns:: notebook regex
Return type:: str
Raises:: ValueError – If starts_with is < 1 or > 89.

checker

Module for type checking functions.

sctoolbox.utils.checker.check_module(module: str) → None[source]

Check if <module> can be imported without error.

Parameters:: module (str) – Name of the module to check.
Raises:: ImportError – If the module is not available for import.

sctoolbox.utils.checker.gunzip_file(f_in: str, f_out: str) → None[source]

Decompress file.

Parameters:

f_in (str) – Path to compressed input file.
f_out (str) – Destination to decompressed output file.

sctoolbox.utils.checker.is_str_numeric(ans: str) → bool[source]

Check if string can be converted to number.

Parameters:: ans (str) – String to check.
Returns:: True if string can be converted to float.
Return type:: bool

sctoolbox.utils.checker.var_index_from(adata: AnnData, from_column: str | None = None) → None[source]

Format adata.var index from specified column or from the index available.

This formats the index of adata.var according to the pattern [“chr”, “start”, “stop”]. The adata is changed inplace.

Parameters:

adata (sc.AnnData) – The anndata object to reformat.
from_column (Optional[str], default None) – Column name in adata.var to be set as index.

sctoolbox.utils.checker.get_index_type(entry: str) → str | None[source]

Check the format of the index by regex.

Parameters:: entry (str) – String to identify the format on.
Returns:: The index format. Either ‘snapatac’, ‘start_name’ or None for unknown format.
Return type:: Optional[str]

sctoolbox.utils.checker.validate_regions(adata: AnnData, coordinate_columns: Iterable[str]) → None[source]

Check if the regions in adata.var are valid.

Parameters:

adata (sc.AnnData) – AnnData object containing the regions to be checked.
coordinate_columns (Iterable[str]) – List of length 3 for column names in adata.var containing chr, start, end coordinates.

Raises:

ValueError – If invalid regions are detected.

sctoolbox.utils.checker.format_adata_var(adata: AnnData, coordinate_columns: Iterable[str] | None = None, columns_added: Iterable[str] = ['chr', 'start', 'end']) → None[source]

Format the index of adata.var and adds peak_chr, peak_start, peak_end columns to adata.var if needed.

If coordinate_columns are given, the function will check if these columns already contain the information needed. If the coordinate_columns are in the correct format, nothing will be done. If the coordinate_columns are invalid (or coordinate_columns is not given) the index is checked for the following format: “*[_:-]start[_:-]stop”

If the index can be formatted, the formatted columns (columns_added) will be added. If the index cannot be formatted, an error will be raised.

NOTE: adata object is changed inplace.

Parameters:

adata (sc.AnnData) – The anndata object containing features to annotate.
coordinate_columns (Optional[Iterable[str][str]], default None) – List of length 3 for column names in adata.var containing chr, start, end coordinates to check. If None, the index will be formatted.
columns_added (Iterable[str], default ['chr', 'start', 'end']) – List of length 3 for column names in adata.var containing chr, start, end coordinates to add.

Raises:

KeyError – If coordinate_columns are not available.
ValueError – If regions are of incorrect format.

sctoolbox.utils.checker.in_range(value: int | float, limits: tuple[int | float, int | float], include_limits: bool = True) → bool[source]

Check if a value is in a given range.

Parameters:

value (int | float) – Number to check if in range.
limits (Tuple[int | float, int | float]) – Lower and upper limits. E.g. (0, 10)
include_limits (bool, default True) – If True includes limits in accepted range.

Returns:

Returns whether the value is between the set limits.

Return type:

bool

Examples

limit = (0.5, 1)
value = 0.5
print(utils.in_range(value=value, limits=limit, include_limits=True))

True

This will return ‘True’; the value is in between the limits including the limits.

Check if all values of arr are integers.

Parameters:: arr (npt.ArrayLike) – Array of values to be checked.
Returns:: True if all values are integers, False otherwise.
Return type:: bool

sctoolbox.utils.checker.check_columns(df: DataFrame, columns: Iterable[str], error: bool = True, name: str = 'dataframe') → bool | None[source]

Check whether columns are found within a pandas dataframe.

TODO do we need this?

Parameters:

df (pd.DataFrame) – A pandas dataframe to check.
columns (Iterable[str]) – A list of column names to check for within df.
error (bool, default True) – If True raise errror if not all columns are found. If False return true or false
name (str, default dataframe) – Dataframe name displayed in the error message.

Returns:

True or False depending on if columns are in dataframe None if error is set to True

Return type:

Optional[bool]

Raises:

KeyError – If any of the columns are not in ‘df’ and error is set to True.

sctoolbox.utils.checker.check_file_ending(file: str, pattern: str = 'gtf') → None[source]

Check if a file has a certain file ending.

TODO do we need this?

Parameters:

file (str) – Path to the file.
pattern (str, default 'gtf') – File ending to be checked for. If regex, the regex must match the entire string.

Raises:

ValueError – If file does not have the expected file ending.

sctoolbox.utils.checker.is_regex(regex: str) → bool[source]

Check if a string is a valid regex.

Parameters:: regex (str) – String to be checked.
Returns:: True if string is a valid regex, False otherwise.
Return type:: bool

sctoolbox.utils.checker.check_marker_lists(adata: AnnData, marker_dict: dict[str, list[str]]) → dict[str, list[str]][source]

Remove genes in custom marker genes lists which are not present in dataset.

Parameters:

adata (sc.AnnData) – The anndata object containing features to annotate.
marker_dict (dict[str, list[str]]) – A dictionary containing a list of marker genes as values and corresponding cell types as keys. The marker genes given in the lists need to match the index of adata.var.

Returns:

A dictionary containing a list of marker genes as values and corresponding cell types as keys.

Return type:

dict[str, list[str]]

sctoolbox.utils.checker.check_type(obj: Any, obj_name: str, test_type: Any)[source]

Check type of given object.

Parameters:

obj (Any) – Object for which the type should be checked
obj_name (str) – Object name that would be shown in the error message.
test_type (Any) – Type that obj is tested for.

Raises:

TypeError – If object type does not match test type.

Notes

This function is mostly replaced by beartype. Only used for types not supported by beartype.

bioutils

Bio related utility functions.

sctoolbox.utils.bioutils.pseudobulk_table(adata: AnnData, groupby: str, how: Literal['mean', 'sum'] = 'mean', layer: str | None = None, percentile_range: tuple[int, int] = (0, 100), chunk_size: int = 1000) → DataFrame[source]

Get a pseudobulk table of values per cluster.

Parameters:

adata (sc.AnnData) – Anndata object with counts in .X.
groupby (str) – Column name in adata.obs from which the pseudobulks are created.
how (Literal['mean', 'sum'], default 'mean') – How to calculate the value per group (psuedobulk).
layer (Optional[str], default None) – Name of an anndata layer to use instead of adata.X.
percentile_range (Tuple[int, int], default (0, 100)) – The percentile of cells used to calculate the mean/sum for each feature. Is used to limit the effect of individual cell outliers, e.g. by setting (0, 95) to exclude high values in the calculation.
chunk_size (int, default 1000) – If percentile_range is not default, chunk_size controls the number of features to process at once. This is used to avoid memory issues.

Returns:

DataFrame with aggregated counts (adata.X). With groups as columns and genes as rows.

Return type:

pd.DataFrame

sctoolbox.utils.bioutils.barcode_index(adata: AnnData) → None[source]

Check if the barcode is the index.

Will replace the index with adata.obs[“barcode”] if index does not contain barcodes.

TODO refactor - name could be more descriptive - return adata - inplace parameter - use logger …

Parameters:: adata (sc.AnnData) – Anndata to perform check on.

sctoolbox.utils.bioutils.get_organism(ensembl_id: str, host: str = 'http://www.ensembl.org/id/') → str[source]

Get the organism name to the given Ensembl ID.

Parameters:

ensembl_id (str) – Any Ensembl ID. E.g. ENSG00000164690
host (str) – Ensembl server address.

Returns:

Organism assigned to the Ensembl ID

Return type:

str

Raises:

ConnectionError – If there is an unexpected (or no) response from the server.
ValueError – If the returned organism is ambiguous.

sctoolbox.utils.bioutils.gene_id_to_name(ids: list[str], species: str) → DataFrame[source]

Get Ensembl gene names to Ensembl gene id.

Parameters:

ids (list[str]) – List of gene ids. Set to None to return all ids.
species (str) – Species matching the gene ids. Set to None for list of available species.

Returns:

DataFrame with gene ids and matching gene names.

Return type:

pd.DataFrame

Raises:

ValueError – If provided Ensembl IDs or organism is invalid.

sctoolbox.utils.bioutils.convert_id(adata: AnnData, id_col_name: str | None = None, index: bool = False, name_col: str = 'Gene name', species: str = 'auto', inplace: bool = True) → AnnData | None[source]

Add gene names to adata.var.

Parameters:

adata (sc.AnnData) – AnnData with gene ids.
id_col_name (Optional[str], default None) – Name of the column in adata.var that stores the gene ids.
index (boolean, default False) – Use index of adata.var instead of column name speciefied in id_col_name.
name_col (str, default "Gene name") – Name of the column added to adata.var.
species (str, default "auto") – Species of the dataset. On default, species is inferred based on gene ids.
inplace (bool, default True) – Whether to modify adata inplace.

Returns:

AnnData object with gene names.

Return type:

Optional[sc.AnnData]

Raises:

ValueError – If invalid parameter choice or column name not found in adata.var.

sctoolbox.utils.bioutils.unify_genes_column(adata: AnnData, column: str, unified_column: str = 'unified_names', species: str = 'auto', inplace: bool = True) → AnnData | None[source]

Given an adata.var column with mixed Ensembl IDs and Ensembl names, this function creates a new column where Ensembl IDs are replaced with their respective Ensembl names.

Parameters:

adata (sc.AnnData) – AnnData object
column (str) – Column name in adata.var
unified_column (str, default "unified_names") – Defines the column in which unified gene names are saved. Set same as parameter ‘column’ to overwrite original column.
species (str, default "auto") – Species of the dataset. On default, species is inferred based on gene ids.
inplace (bool, default True) – Whether to modify adata or return a copy.

Returns:

AnnData object with modified gene column.

Return type:

Optional[sc.AnnData]

Raises:

ValueError – If column name is not found in adata.var or no Ensembl IDs in selected column.

decorator

Decorators and related functions.

sctoolbox.utils.decorator.log_anndata(func: Callable) → Callable[source]

Decorate function to log adata inside function call.

Parameters:: func (Callable) – Function to decorate.
Returns:: Decorated function
Return type:: Callable

sctoolbox.utils.decorator.get_parameter_table(adata: AnnData) → DataFrame[source]

Get a table of all function calls with their parameters from the adata.uns[“sctoolbox”] dictionary.

Parameters:: adata (sc.AnnData) – Annotated data matrix with logged function calls.
Returns:: Table with all function calls and their parameters.
Return type:: pd.DataFrame
Raises:: ValueError – If no logs are found.

sctoolbox.utils.decorator.debug_func_log(func: Callable) → None[source]

Decorate function to print function call with arguments and keyword arguments.

In progress.

Parameters:: func (Callable) – Function to decorate.

multiprocessing

Functions related to multiprocessing.

sctoolbox.utils.multiprocessing.get_pbar(total: int, description: str, **kwargs: Any) → tqdm.tqdm[source]

Get a progress bar depending on whether the user is using a notebook or not.

Parameters:

total (int) – Total number elements to be shown in the progress bar.
description (str) – Description to be shown in the progress bar.
**kwargs (Any) – Keyword arguments to be passed to tqdm.

Returns:

A progress bar object.

Return type:

tqdm.tqdm

sctoolbox.utils.multiprocessing.monitor_jobs(jobs: dict[tuple[int, int | str], Any] | list[Any], description: str = 'Progress') → None[source]

Monitor the status of jobs submitted to a pool.

Parameters:

jobs (dict[Tuple[int, int | str], Any] | list[Any]) – List or dict of job objects, e.g. as returned by pool.map_async().
description (str, default "Progress") – Description to be shown in the progress bar.

jupyter

Jupyter notebook related functions.

sctoolbox.utils.jupyter.clear() → None[source]

Clear stout of console or jupyter notebook.

https://stackoverflow.com/questions/37071230/clear-overwrite-standard-output-in-python

tables

Table related functions.

sctoolbox.utils.tables.rename_categories(series: Series) → Series[source]

Rename categories in a pandas series to numbers between 1-(number of categories).

Parameters:: series (pd.Series) – Pandas Series to rename categories in.
Returns:: Series with renamed categories.
Return type:: pd.Series

sctoolbox.utils.tables.fill_na(df: DataFrame, inplace: bool = True, replace: dict[str, Any] = {'bool': False, 'category': '', 'float': 0, 'int': 0, 'str': '-'}) → DataFrame | None[source]

Fill all NA values in a pandas DataFrame depending on the column data type.

Parameters:

df (pd.DataFrame) – DataFrame object with NA values over multiple columns
inplace (boolean, default True) – Whether the DataFrame object is modified inplace.
replace (dict[str, Any], default {"bool": False, "str": "-", "float": 0, "int": 0, "category": ""}) – dict that contains default values to replace nas depedning on data type

Returns:

DataFrame with replaced NA values.

Return type:

Optional[pd.DataFrame]

sctoolbox.utils.tables.write_excel(table_dict: dict[str, Any], filename: str, index: bool = False, **kwargs: Any) → None[source]

Write a dictionary of tables to a single excel file with one table per sheet.

Parameters:

table_dict (dict[str, Any]) – Dictionary of tables in the format {<sheet_name1>: table, <sheet_name2>: table, (…)}.
filename (str) – Path to output file.
index (bool, default False) – Whether to include the index of the tables in file.
**kwargs (Any) – Keyword arguments passed to pandas.DataFrame.to_excel.

Raises:

Exception – If table_dict contains items not of type DataFrame.

sctoolbox.utils.tables.table_zscore(table: DataFrame, how: Literal['row', 'col'] = 'row') → DataFrame[source]

Z-score a table.

Parameters:

table (pd.DataFrame) – Table to z-score.
how ({'row', 'col'}) – Whether to z-score rows or columns.

Returns:

Z-scored table.

Return type:

pd.DataFrame

Raises:

Exception – If how has invalid selection.

adata

anndata.AnnData related functions.

sctoolbox.utils.adata.get_adata_subsets(adata: AnnData, groupby: str) → dict[str, AnnData][source]

Split an anndata object into a dict of sub-anndata objects based on a grouping column.

Parameters:

adata (sc.AnnData) – Anndata object to split.
groupby (str) – Column name in adata.obs to split by.

Returns:

Dictionary of anndata objects in the format {<group1>: anndata, <group2>: anndata, (…)}.

Return type:

dict[str, sc.AnnData]

Raises:

ValueError – If groupby is not found in adata.obs.columns.

sctoolbox.utils.adata.add_expr_to_obs(adata: AnnData, gene: str) → None[source]

Add expression of a gene from adata.X to adata.obs as a new column.

Parameters:

adata (sc.AnnData) – Anndata object to add expression to.
gene (str) – Gene name to add expression of.

Raises:

Exception – If the gene is not found in the adata object.

sctoolbox.utils.adata.shuffle_cells(adata: AnnData, seed: int = 42) → AnnData[source]

Shuffle cells in an adata object to improve plotting.

Otherwise, cells might be hidden due plotting samples in order e.g. sample1, sample2, etc.

Parameters:

adata (sc.AnnData) – Anndata object to shuffle cells in.
seed (int, default 42) – Seed for random number generator.

Returns:

Anndata object with shuffled cells.

Return type:

sc.AnnData

sctoolbox.utils.adata.get_minimal_adata(adata: AnnData) → AnnData[source]

Return a minimal copy of an anndata object e.g. for estimating UMAP in parallel.

Parameters:: adata (sc.AnnData) – Annotated data matrix.
Returns:: Minimal copy of anndata object.
Return type:: sc.AnnData

sctoolbox.utils.adata.load_h5ad(path: str) → AnnData[source]

Load an anndata object from .h5ad file.

Parameters:: path (str) – Name of the file to load the anndata object. NOTE: Uses the internal ‘sctoolbox.settings.adata_input_dir’ + ‘sctoolbox.settings.adata_input_prefix’ as prefix.
Returns:: Loaded anndata object.
Return type:: sc.AnnData

sctoolbox.utils.adata.save_h5ad(adata: AnnData, path: str) → None[source]

Save an anndata object to an .h5ad file.

Parameters:

adata (sc.AnnData) – Anndata object to save.
path (str) – Name of the file to save the anndata object. NOTE: Uses the internal ‘sctoolbox.settings.adata_output_dir’ + ‘sctoolbox.settings.adata_output_prefix’ as prefix.

sctoolbox.utils.adata.add_uns_info(adata: AnnData, key: str | list[str], value: Any, how: str = 'overwrite') → None[source]

Add information to adata.uns[‘sctoolbox’].

This is used for logging the parameters and options of different steps in the analysis.

Parameters:

adata (sc.AnnData) – An AnnData object.
key (str | list[str]) – The key to add to adata.uns[‘sctoolbox’]. If the key is a list, it represents a path within a nested dictionary.
value (Any) – The value to add to adata.uns[‘sctoolbox’].
how (str, default "overwrite") – When set to “overwrite” provided key will be overwriten. If “append” will add element to existing list or dict.

Raises:

ValueError – If value can not be appended.

sctoolbox.utils.adata.get_cell_values(adata: AnnData, element: str) → ndarray[source]

Get the values of a given element in adata.obs or adata.var per cell in adata. Can for example be used to extract gene expression values.

Parameters:

adata (anndata.AnnData) – Anndata object.
element (str) – The element to extract from adata.obs or adata.var, e.g. a column in adata.obs or an index in adata.var.

Returns:

Array of values per cell in adata.

Return type:

np.ndarray

Raises:

ValueError – If element is not found in adata.obs or adata.var.

Prepare the given adata for cellxgene deployment.

Parameters:

adata (sc.Anndata) – Anndata object.
keep_obs (Optional[list[str]], default None) – adata.obs columns that should be kept. None to keep all.
keep_var (Optional[list[str]], default None) – adata.var columns that should be kept. None to keep all.
rename_obs (Optional[dict[str, str]], default None) – Dictionary of .obs columns to rename. Key is the old name, value the new one.
rename_var (Optional[dict[str, str]], default None) – Dictionary of .var columns to rename. Key is the old name, value the new one.
embedding_names (Optional[list[str]], default ["pca", "umap", "tsne"]) – List of embeddings to check for. Will raise an error if none of the embeddings are found. Set None to disable check. Embeddings are stored in adata.obsm.
cmap (Optional[str], default None) – Color map to use for continous variables. Use this replacement color map for broken color maps. If None will use scanpy default, which uses mpl.rcParams[“image.cmap”]. See sc.pl.embedding.
palette (Optional[str | Sequence[str]], default None) – Color map to use for categorical annotation groups. Use this replacement color map for broken color maps. If None will use scanpy default, which uses mpl.rcParams[“axes.prop_cycle”]. See sc.pl.embedding.
inplace (bool, default False)

Raises:

ValueError – If not at least one of the named embeddings are found in the adata.

Returns:

Returns the deployment ready Anndata object.

Return type:

Optional[sc.AnnData]

assemblers

Module to assemble anndata objects.

sctoolbox.utils.assemblers.prepare_atac_anndata(adata: AnnData, set_index: bool = True, index_from: str | None = None, coordinate_cols: list[str] | None = None, h5ad_path: str | None = None) → AnnData[source]

Prepare AnnData object of ATAC-seq data to be in the correct format for the subsequent pipeline.

This includes formatting the index, formatting the coordinate columns, and setting the barcode as the index.

Parameters:

adata (sc.AnnData) – The AnnData object to be prepared.
set_index (bool, default True) – If True, index will be formatted and can be set by a given column.
index_from (Optional[str], default None) – Column to build the index from.
coordinate_cols (Optional[list[str]], default None) – Location information of the peaks.
h5ad_path (Optional[str], default None) – Path to the h5ad file.

Returns:

The prepared AnnData object.

Return type:

sc.AnnData

sctoolbox.utils.assemblers.from_single_starsolo(path: str, dtype: Literal['filtered', 'raw'] = 'filtered', header: int | list[int] | Literal['infer'] | None = 'infer') → AnnData[source]

Assembles an anndata object from the starsolo folder.

Parameters:

path (str) – Path to the “solo” folder from starsolo.
dtype (Literal['filtered', 'raw'], default "filtered") – The type of solo data to choose.
header (Union[int, list[int], Literal['infer'], None], default "infer") – Set header parameter for reading metadata tables using pandas.read_csv.

Returns:

An anndata object based on the provided starsolo folder.

Return type:

sc.AnnData

Raises:

FileNotFoundError – If path does not exist or files are missing.

sctoolbox.utils.assemblers.from_quant(path: str, configuration: list = [], use_samples: list | None = None, dtype: Literal['raw', 'filtered'] = 'filtered') → AnnData[source]

Assemble an adata object from data in the ‘quant’ folder of the snakemake pipeline.

Parameters:

path (str) – The directory where the quant folder from snakemake preprocessing is located.
configuration (list) – Configurations to setup the samples for anndata assembling. It must containg the sample, the word used in snakemake to assign the condition, and the condition, e.g., sample1:condition:room_air
use_samples (Optional[list], default None) – List of samples to use. If None, all samples will be used.
dtype (Literal["raw", "filtered"], default 'filtered') – The type of Solo data to choose.

Returns:

The assembled anndata object.

Return type:

sc.AnnData

Raises:

ValueError – If use_samples contains not existing names.

sctoolbox.utils.assemblers.from_single_mtx(mtx: str, barcodes: str, genes: str, transpose: bool = True, header: int | list[int] | Literal['infer'] | None = 'infer', barcode_index: int = 0, genes_index: int = 0, delimiter: str = '\t', **kwargs: Any) → AnnData[source]

Build an adata object from single mtx and two tsv/csv files.

Parameters:

mtx (str) – Path to the mtx file (.mtx)
barcodes (str) – Path to cell label file (.obs)
genes (str) – Path to gene label file (.var)
transpose (bool, default True) – Set True to transpose mtx matrix.
header (Union[int, list[int], Literal['infer'], None], default 'infer') – Set header parameter for reading metadata tables using pandas.read_csv.
barcode_index (int, default 0) – Column which contains the cell barcodes.
genes_index (int, default 0) – Column which contains the gene IDs.
delimiter (str, default 't') – delimiter of genes and barcodes table.
**kwargs (Any) – Contains additional arguments for scanpy.read_mtx method

Returns:

Anndata object containing the mtx matrix, gene and cell labels

Return type:

sc.AnnData

Raises:

ValueError – If barcode or gene files contain duplicates.

sctoolbox.utils.assemblers.from_mtx(path: str, mtx: str = '*_matrix.mtx*', barcodes: str = '*_barcodes.tsv*', genes: str = '*_genes.tsv*', **kwargs: Any) → AnnData[source]

Build an adata object from list of mtx, barcodes and genes files.

Parameters:

path (str) – Path to data files
mtx (str, default ‘_matrix.mtx’) – String for glob to find matrix files.
barcodes (str, default ‘_barcodes.tsv’) – String for glob to find barcode files.
genes (str, default ‘_genes.tsv’) – String for glob to find gene label files.
**kwargs (Any) – Contains additional arguments for scanpy.read_mtx method

Returns:

Merged anndata object containing the mtx matrix, gene and cell labels

Return type:

sc.AnnData

Raises:

ValueError – If files are not found.

sctoolbox.utils.assemblers.convertToAdata(file: str, output: str | None = None, r_home: str | None = None, layer: str | None = None) → AnnData | None[source]

Convert .rds files containing Seurat or SingleCellExperiment to scanpy anndata.

In order to work an R installation with Seurat & SingleCellExperiment is required.

Parameters:

file (str) – Path to the .rds or .robj file.
output (Optional[str], default None) – Path to output .h5ad file. Won’t save if None.
r_home (Optional[str], default None) – Path to the R home directory. If None will construct path based on location of python executable. E.g for “.conda/scanpy/bin/python” will look at “.conda/scanpy/lib/R”
layer (Optional[str], default None) – Provide name of layer to be stored in anndata. By default the main layer is stored. In case of multiome data multiple layers are present e.g. RNA and ATAC. But anndata can only store a single layer.

Returns:

Returns converted anndata object if output is None.

Return type:

Optional[sc.AnnData]

general

General utility functions.

sctoolbox.utils.general.get_user() → str[source]

Get the name of the current user.

Returns:: The name of the current user.
Return type:: str

sctoolbox.utils.general.get_datetime() → str[source]

Get a string with the current date and time for logging.

Returns:: A string with the current date and time in the format dd/mm/YY H:M:S
Return type:: str

sctoolbox.utils.general.get_package_versions() → dict[str, str][source]

Receive a dictionary of currently installed python packages and versions.

Returns:: A dict in the form: {“package1”: “1.2.1”, “package2”:”4.0.1”, (…)}
Return type:: dict[str, str]

sctoolbox.utils.general.get_binary_path(tool: str) → str[source]

Get path to a binary commandline tool.

Looks either in the local dir, on path or in the dir of the executing python binary.

Parameters:: tool (str) – Name of the commandline tool to be found.
Returns:: Full path to the tool.
Return type:: str
Raises:: ValueError – If executable is not found.

sctoolbox.utils.general.run_cmd(cmd: str) → None[source]

Run a commandline command.

Parameters:: cmd (str) – Command to be run.
Raises:: subprocess.CalledProcessError – If command has an error.

sctoolbox.utils.general.setup_R(r_home: str | None = None) → None[source]

Add R installation for rpy2 use.

Parameters:: r_home (Optional[str], default None) – Path to the R home directory. If None will construct path based on location of python executable. E.g for “.conda/scanpy/bin/python” will look at “.conda/scanpy/lib/R”
Raises:: Exception – If path to R is invalid.

sctoolbox.utils.general.split_list(lst: Sequence[Any], n: int) → list[Sequence[Any]][source]

Split list into n chunks.

Parameters:

lst (Sequence[Any]) – Sequence to be chunked
n (int) – Number of chunks.

Returns:

List of Sequences (chunks).

Return type:

list[Sequence[Any]]

sctoolbox.utils.general.split_list_size(lst: list[Any], max_size: int) → list[list[Any]][source]

Split list into chunks of max_size.

Parameters:

lst (list[Any]) – List to be chunked
max_size (int) – Max size of chunks.

Returns:

List of lists (chunks).

Return type:

list[list[Any]]

sctoolbox.utils.general.write_list_file(lst: list[Any], path: str) → None[source]

Write a list to a file with one element per line.

Parameters:

lst (list[Any]) – A list of values/strings to write to file
path (str) – Path to output file.

sctoolbox.utils.general.read_list_file(path: str) → list[str][source]

Read a list from a file with one element per line.

Parameters:: path (str) – Path to read file from.
Returns:: List of strings read from file.
Return type:: list[str]

sctoolbox.utils.general.clean_flanking_strings(list_of_strings: list[str]) → list[str][source]

Remove common suffix and prefix from a list of strings.

E.g. running the function on [‘path/a.txt’, ‘path/b.txt’, ‘path/c.txt’] would yield [‘a’, ‘b’, ‘c’].

Parameters:: list_of_strings (list[str]) – List of strings.
Returns:: List of strings without common suffix and prefix
Return type:: list[str]

sctoolbox.utils.general.longest_common_suffix(list_of_strings: list[str]) → str[source]

Find the longest common suffix of a list of strings.

Parameters:: list_of_strings (list[str]) – List of strings.
Returns:: Longest common suffix of the list of strings.
Return type:: str

sctoolbox.utils.general.remove_prefix(s: str, prefix: str) → str[source]

Remove prefix from a string.

Parameters:

s (str) – String to be processed.
prefix (str) – Prefix to be removed.

Returns:

String without prefix.

Return type:

str

sctoolbox.utils.general.remove_suffix(s: str, suffix: str) → str[source]

Remove suffix from a string.

Parameters:

s (str) – String to be processed.
suffix (str) – Suffix to be removed.

Returns:

String without suffix.

Return type:

str

sctoolbox.utils.general.sanitize_string(s: str, char_list: list[str], replace: str = '_') → str[source]

Replace every occurrence of given substrings.

Parameters:

s (str) – String to sanitize
char_list (list[str]) – Strings that should be replaced.
replace (str, default "_") – Replacement of substrings.

Returns:

Sanitized string.

Return type:

str

sctoolbox.utils.general.identify_columns(df: DataFrame, regex: list[str] | str) → list[str][source]

Get columns from pd.DataFrame that match the given regex.

Parameters:

df (pd.DataFrame) – Pandas dataframe to be checked.
regex (Union(list[str], str)) – List of multiple regex or one regex as string.

Returns:

List of column names that match one of the provided regex.

Return type:

list[str]

Small utility to scale values in array to a given range.

Parameters:

array (npt.ArrayLike) – Array to scale.
mini (int | float) – Minimum value of the scale.
maxi (int | float) – Maximum value of the scale.

Returns:

Scaled array values.

Return type:

np.ndarray

io

File input/output utilities.

sctoolbox.utils.io.create_dir(path: str) → None[source]

Create a directory if it is not existing yet.

‘path’ can be either a direct path of the directory, or a path to a file for which the upper directory should be created.

Parameters:: path (str) – Path to the directory to be created.

sctoolbox.utils.io.get_temporary_filename(tempdir: str = '.') → str[source]

Get a writeable temporary filename by creating a temporary file and closing it again.

Parameters:: tempdir (str, default ".") – The path where the temp file will be created.
Returns:: Name of the temporary file.
Return type:: str

sctoolbox.utils.io.remove_files(file_list: list[str]) → None[source]

Delete all files in a file list. Prints a warning if deletion was not possible.

Parameters:: file_list (list[str]) – List of files to delete.

Deprecated since version 0.4b: This will be removed in 0.6. Use rm_tmp() with rm_dir=False.

sctoolbox.utils.io.rm_tmp(temp_dir: str | None = None, temp_files: list[str] | None = None, rm_dir: bool = False, all: bool = False) → None[source]

Delete given directory.

Removes all given temp_files from the directory. If temp_files is None and all is True all files are removed.

Parameters:

temp_dir (Optional[list[str]], default None) – Path to the temporary directory.
temp_files (Optional[list[str]], default None) – Paths to files to be deleted before removing the temp directory.
rm_dir (bool, default False) – If True, the temp directory is removed.
all (bool, default False) – If True, all files in the temp directory are removed.