Tools

Preprocessing

qc_filter

Tools for quality control.

sctoolbox.tools.qc_filter.calculate_qc_metrics(adata: AnnData, percent_top: list[int] | None = None, inplace: bool = False, **kwargs: Any) → AnnData | None[source]

Calculate the qc metrics using scanpy.pp.calculate_qc_metrics.

Parameters:

adata (sc.AnnData) – Anndata object the quality metrics are added to.
percent_top (Optional[list[int]], default None) – Which proportions of top genes to cover.
inplace (bool, default False) – If the anndata object should be modified in place.
**kwargs (Any) – Additional parameters forwarded to scanpy.pp.calculate_qc_metrics.

Returns:

Returns anndata object with added quality metrics to .obs and .var. Returns None if inplace=True.

Return type:

Optional[sc.AnnData]

highly_variable

Tools to calculate and annotate highly variable genes.

sctoolbox.tools.highly_variable.annot_HVG(anndata: AnnData, min_mean: float = 0.0125, max_iterations: int = 10, hvg_range: tuple[int, int] = (1000, 5000), step: int | float = 10, inplace: bool = True, save: str | None = None, **kwargs: Any) → AnnData | None[source]

Annotate highly variable genes (HVG). Tries to annotate in given range of HVGs, by gradually in-/ decreasing min_mean of scanpy.pp.highly_variable_genes.

Parameters:

anndata (sc.AnnData) – Anndata object to annotate.
min_mean (float, default 0.0125) – Starting min_mean parameter for finding HVGs.
max_iterations (int, default 10) – Maximum number of min_mean adjustments.
hvg_range (Tuple[int, int], default (1000, 5000)) – Number of HVGs should be in the given range. Will issue a warning if result is not in range. Default limits are chosen as proposed by https://doi.org/10.15252/msb.20188746.
step (int | float, default 10) – Value min_mean is adjusted by in each iteration. Will divide min_value (below range) or multiply (above range) by this value.
inplace (bool, default False) – Whether the anndata object is modified inplace.
save (Optional[str], default None) – Path to save the plot to. If None, the plot is not saved.
**kwargs (Any) – Additional arguments forwarded to scanpy.pp.highly_variable_genes().

Notes

Logarithmized data is expected.

Returns:: Adds annotation of HVG to anndata object. Information is added to Anndata.var[“highly_variable”]. If inplace is False, the function returns None Else returns a chagned copy of the input anndata object.
Return type:: Optional[sc.AnnData]

Get the highly variable features of anndata object. Adds the column “highly_variable” to adata.var. If show is True, the plot is shown.

Parameters:

adata (sc.AnnData) – The anndata object containing counts for variables.
max_cells (Optional[float | int], default None) – The maximum variability score to set as threshold. Defaults to knee estimated threshold.
min_cells (Optional[float | int], default 0) – The minimum variability score to set as threshold.
show (bool, default True) – Show plot of variability scores and thresholds.
inplace (bool, default True) – If True, the anndata object is modified. Otherwise, a new anndata object is returned.

Notes

Designed for scATAC-seq data

Raises:: KeyError – If adata.var[‘n_cells_by_counts’] is not available.
Returns:: If inplace is False, the function returns None If inplace is True, the function returns an anndata object.
Return type:: Optional[sc.AnnData]

dim_reduction

Tools for dimensionality reduction with PCA/SVD.

sctoolbox.tools.dim_reduction.compute_PCA(anndata: AnnData, use_highly_variable: bool = True, inplace: bool = False, **kwargs: Any) → AnnData | None[source]

Compute a principal component analysis.

Parameters:

anndata (sc.AnnData) – Anndata object to add the PCA to.
use_highly_variable (bool, default True) – If true, use highly variable genes to compute PCA.
inplace (bool, default False) – Whether the anndata object is modified inplace.
**kwargs (Any) – Additional parameters forwarded to scanpy.pp.pca().

Returns:

Returns anndata object with PCA components. Or None if inplace = True.

Return type:

Optional[sc.AnnData]

sctoolbox.tools.dim_reduction.lsi(data: AnnData, scale_embeddings: bool = True, n_comps: int = 50, use_highly_variable: bool = False) → None[source]

Run Latent Semantic Indexing for dimensionality reduction.

Values represent the similarity of cells in the original space. doi: 10.3389/fphys.2013.00008

Parameters:

data (sc.AnnData) – AnnData object with peak counts.
scale_embeddings (bool, default True) – Scale embeddings to zero mean and unit variance.
n_comps (int, default 50) – Number of components to calculate with SVD.
use_highly_variable (bool, default True) – If true, use highly variable genes to compute LSI.

Raises:

ValueError: – If highly variable genes are used and not found in adata.var[‘highly_variable’].

Notes

Function is taken from muon package.

sctoolbox.tools.dim_reduction.apply_svd(adata: AnnData, layer: str | None = None) → AnnData[source]

Singular value decomposition of anndata object.

Parameters:

adata (sc.AnnData) – The anndata object to be decomposed.
layer (Optional[str], default None) – The layer to be decomposed. If None, the layer is set to “X”.

Returns:

The decomposed anndata object containing .obsm, .varm and .uns information.

Return type:

sc.AnnData

sctoolbox.tools.dim_reduction.define_PC(anndata: AnnData) → int[source]

Define threshold for most variable PCA components.

Note: Function expects PCA to be computed beforehand.

Parameters:: anndata (sc.AnnData) – Anndata object with PCA to get significant PCs threshold from.
Returns:: An int representing the number of PCs until elbow, defining PCs with significant variance.
Return type:: int
Raises:: ValueError: – If PCA is not found in anndata.

Deprecated since version 0.5: This will be removed in 0.7. Use the ‘sctoolbox.tools.propose_pcs’ function instead.

sctoolbox.tools.dim_reduction.propose_pcs(anndata: AnnData, how: list[Literal['variance', 'cumulative variance', 'correlation']] = ['variance', 'correlation'], var_method: Literal['knee', 'percent'] = 'percent', perc_thresh: int | float = 30, corr_thresh: float = 0.3, corr_kwargs: dict | None = {}) → list[int][source]

Propose a selection of PCs that can be used for further analysis.

Note: Function expects PCA to be computed beforehand.

Parameters:

anndata (sc.AnnData) – Anndata object with PCA to get PCs from.
how (List[Literal["variance", "cumulative variance", "correlation"]], default ["variance", "correlation"]) – Values to use for PC proposal. Will independently apply filters to all selected methods and return the union of PCs.
var_method (Literal["knee", "percent"], default "percent") – Either define a threshold based on a knee algorithm or use the percentile.
perc_thresh (Union[int, float], default 30) – Percentile threshold of the PCs that should be included. Only for var_method=”percent” and expects a value from 0-100.
corr_thresh (float, default 0.3) – Filter PCs with a correlation greater than the given value.
corr_kwargs (Optional(dict), default None) – Parameters forwarded to sctoolbox.tools.correlation_matrix.

Returns:

List of PCs proposed for further use.

Return type:

List[int]

Raises:

ValueError: – If PCA is not found in anndata.

sctoolbox.tools.dim_reduction.subset_PCA(adata: AnnData, n_pcs: int | None = None, start: int = 0, select: list[int] | None = None, inplace: bool = True) → AnnData | None[source]

Subset the PCA coordinates in adata.obsm[“X_pca”] to the given number of pcs.

Additionally, subset the PCs in adata.varm[“PCs”] and the variance ratio in adata.uns[“pca”][“variance_ratio”].

Parameters:

adata (sc.AnnData) – Anndata object containing the PCA coordinates.
n_pcs (Optional[int], default None) – Number of PCs to keep.
start (int, default 0) – Index (0-based) of the first PC to keep. E.g. if start = 1 and n_pcs = 10, you will exclude the first PC to keep 9 PCs.
select (Optional[List[int]], default None) – Provide a list of PC numbers to keep. E.g. [2, 3, 5] will select the second, third and fifth PC. Will overwrite the n_pcs and start parameter.
inplace (bool, default True) – Whether to work inplace on the anndata object.

Returns:

Anndata object with the subsetted PCA coordinates. Or None if inplace = True.

Return type:

Optional[sc.AnnData]

norm_correct

Normalization and correction tools.

sctoolbox.tools.norm_correct.atac_norm(*args: Any, **kwargs: Any)[source]: Normalize ATAC data - deprecated functionality. Use normalize_adata instead.

sctoolbox.tools.norm_correct.normalize_adata(adata: AnnData, method: str | list[str], exclude_highly_expressed: bool = True, use_highly_variable: bool = False, target_sum: int | None = None) → dict[str, AnnData][source]

Normalize the count matrix and calculate dimension reduction using different methods.

Parameters:

adata (sc.AnnData) – Annotated data matrix.
method (str | list[str]) – Normalization method. Either ‘total’ and/or ‘tfidf’. - ‘total’: Performs normalization for total counts, log1p and PCA. - ‘tfidf’: Performs TFIDF normalization and LSI (corresponds to PCA). This method is often used for scATAC-seq data.
exclude_highly_expressed (bool, default True) – Parameter for sc.pp.normalize_total. Decision to exclude highly expressed genes (HEG) from total normalization.
use_highly_variable (bool, default False) – Parameter for sc.pp.pca and lsi. Decision to use highly variable genes for PCA/LSI.
target_sum (Optional[int], default None) – Parameter for sc.pp.normalize_total. Decide the target sum of each cell after normalization.

Returns:

Dictionary containing method name as key, and anndata as values. Each anndata is the annotated data matrix with normalized count matrix and PCA/LSI calculated.

Return type:

dict[str, sc.AnnData]

Raises:

ValueError – If method is not valid. Needs to be either ‘total’ or ‘tfidf’.

sctoolbox.tools.norm_correct.tfidf(data: AnnData, log_tf: bool = True, log_idf: bool = True, log_tfidf: bool = False, scale_factor: int = 10000) → None[source]

Transform peak counts with TF-IDF (Term Frequency - Inverse Document Frequency).

TF: peak counts are normalised by total number of counts per cell. DF: total number of counts for each peak. IDF: number of cells divided by DF. By default, log(TF) * log(IDF) is returned.

Parameters:

data (sc.AnnData) – AnnData object with peak counts.
log_tf (bool, default True) – Log-transform TF term if True.
log_idf (bool, default True) – Log-transform IDF term if True.
log_tfidf (bool, default Frue) – Log-transform TF*IDF term if True. Can only be used when log_tf and log_idf are False.
scale_factor (int, default 1e4) – Scale factor to multiply the TF-IDF matrix by.

Notes

Function is from the muon package.

Raises:: AttributeError: – log(TF*IDF) requires log(TF) and log(IDF) to be False.

sctoolbox.tools.norm_correct.tfidf_normalization(matrix: spmatrix, tf_type: Literal['raw', 'term_frequency', 'log'] = 'term_frequency', idf_type: Literal['unary', 'inverse_freq', 'inverse_freq_smooth'] = 'inverse_freq') → csr_matrix[source]

Perform TF-IDF normalization on a sparse matrix.

The different variants of the term frequency and inverse document frequency are obtained from https://en.wikipedia.org/wiki/Tf-idf.

Parameters:

matrix (scipy.sparse matrix) – The matrix to be normalized.
tf_type (Literal["term_frequency", "log", "raw"], default "term_frequency") – The type of term frequency to use. Can be either “raw”, “term_frequency” or “log”.
idf_type (Literal["inverse_freq", "unary", "inverse_freq_smooth"], default "inverse_freq") – The type of inverse document frequency to use. Can be either “unary”, “inverse_freq” or “inverse_freq_smooth”.

Returns:

tfidf normalized sparse matrix.

Return type:

sparse.csr_matrix

Notes

This function requires a lot of memory. Another option is to use the ac.pp.tfidf of the muon package.

sctoolbox.tools.norm_correct.wrap_corrections(adata: AnnData, batch_key: str, methods: Literal['bbknn', 'combat', 'mnn', 'harmony', 'scanorama'] | list[Literal['bbknn', 'combat', 'mnn', 'harmony', 'scanorama']] | Callable = ['bbknn', 'mnn'], method_kwargs: dict = {}) → dict[str, AnnData][source]

Calculate multiple batch corrections for adata using the ‘batch_correction’ function.

Parameters:

adata (sc.AnnData) – An annotated data matrix object to apply corrections to.
batch_key (str) – The column in adata.obs containing batch information.
methods (list[batch_methods] | Callable | batch_methods, default ["bbknn", "mnn"]) – The method(s) to use for batch correction. Options are: - bbknn - mnn - harmony - scanorama - combat Or provide a custom batch correction function. See batch_correction(method) for more information.
method_kwargs (dict, default {}) – Dict with methods as keys. Values are dicts of additional parameters forwarded to method. See batch_correction(**kwargs).

Returns:

Dictonary of batch corrected anndata objects. Where the key is the correction method and the value is the corrected anndata.

Return type:

dict[str, sc.AnnData]

Raises:

ValueError – If not all methods in methods are valid.

sctoolbox.tools.norm_correct.batch_correction(adata: AnnData, batch_key: str, method: Literal['bbknn', 'combat', 'mnn', 'harmony', 'scanorama'] | list[Literal['bbknn', 'combat', 'mnn', 'harmony', 'scanorama']] | Callable = ['bbknn', 'mnn'], highly_variable: bool = True, **kwargs: Any) → AnnData[source]

Perform batch correction on the adata object using the ‘method’ given.

Parameters:

adata (sc.AnnData) – An annotated data matrix object to apply corrections to.
batch_key (str) – The column in adata.obs containing batch information.
method (str or function) –
Either one of the predefined methods or a custom function for batch correction. Note: The custom function is expected to accept an anndata object as the first parameter and return the batch corrected anndata.
Available methods:
- bbknn
- mnn
- harmony
- scanorama
- combat
highly_variable (bool, default True) – Only for method ‘mnn’. If True, only the highly variable genes (column ‘highly_variable’ in .var) will be used for batch correction.
**kwargs (Any) – Additional arguments will be forwarded to the method function.

Returns:

A copy of the anndata with applied batch correction.

Return type:

sc.AnnData

Raises:

ValueError: –
1. If batch_key column is not in adata.obs 2. If batch correction method is invalid.
KeyError: – If PCA has not been calculated before running bbknn.

sctoolbox.tools.norm_correct.evaluate_batch_effect(adata: AnnData, batch_key: str, obsm_key: str = 'X_umap', col_name: str = 'LISI_score', max_dims: int = 5, perplexity: int = 30, inplace: bool = False) → AnnData | None[source]

Evaluate batch effect methods using LISI.

Parameters:

adata (sc.AnnData) – Anndata object with PCA and umap/tsne for batch evaluation.
batch_key (str) – The column in adata.obs containing batch information.
obsm_key (str, default 'X_umap') – The column in adata.obsm containing coordinates.
col_name (str, default 'LISI_score') – Column name for storing the LISI score in .obs.
max_dims (int, default 5) – Maximum number of dimensions of adata.obsm matrix to use for LISI (to speed up computation).
perplexity (int, default 30) – Perplexity for the LISI score calculation.
inplace (bool, default False) – Whether to work inplace on the anndata object.

Returns:

if inplace is True, LISI_score is added to adata.obs inplace (returns None), otherwise a copy of the adata is returned.

Return type:

Optional[sc.AnnData]

Notes

LISI score is calculated for each cell and it is between 1-n for a data-frame with n categorical variables.
indicates the effective number of different categories represented in the local neighborhood of each cell.
If the cells are well-mixed, then we expect the LISI score to be near n for a data with n batches.
The higher the LISI score is, the better batch correction method worked to normalize the batch effect and mix the cells from different batches.
For further information on LISI: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1850-9

Raises:

KeyError: –

If obsm_key is not in adata.obsm. 2. If batch_key is no column in adata.obs.

sctoolbox.tools.norm_correct.wrap_batch_evaluation(adatas: dict[str, AnnData], batch_key: str, obsm_keys: str | list[str] = ['X_pca', 'X_umap'], threads: int = 1, max_dims: int = 5, inplace: bool = False) → dict[str, AnnData] | None[source]

Evaluate batch correction methods for a dict of anndata objects (using LISI score calculation).

Parameters:

adatas (dict[str, sc.AnnData]) – Dict containing an anndata object for each batch correction method as values. Keys are the name of the respective method. E.g.: {“bbknn”: anndata}
batch_key (str) – The column in adata.obs containing batch information.
obsm_keys (str | list[str], default ['X_pca', 'X_umap']) – Key(s) to coordinates on which the score is calculated.
threads (int, default 1) – Number of threads to use for parallelization.
max_dims (int, default 5) – Maximum number of dimensions of adata.obsm matrix to use for LISI (to speed up computation).
inplace (bool, default False) – Whether to work inplace on the anndata dict.

Returns:

Dict containing an anndata object for each batch correction method as values of LISI scores added to .obs.

Return type:

Optional[dict[str, sc.AnnData]]

Analysis

embedding

Embedding tools.

sctoolbox.tools.embedding.wrap_umap(adatas: Iterable[AnnData], threads: int = 4, **kwargs: Any) → None[source]

Compute umap for a list of adatas in parallel.

Parameters:

adatas (Iterable[sc.AnnData]) – List of anndata objects to compute umap on.
threads (int, default 4) – Number of threads to use.
**kwargs (Any) – Additional arguments to be passed to sc.tl.umap.

sctoolbox.tools.embedding.correlation_matrix(adata: AnnData, which: Literal['obs', 'var'] = 'obs', basis: str = 'pca', n_components: int | None = None, columns: list[str] | None = None, method: Literal['spearmanr', 'pearsonr'] = 'spearmanr') → tuple[DataFrame, DataFrame][source]

Compute a matrix of correlation values between an embedding and given columns.

Parameters:

adata (sc.AnnData) – Annotated data matrix object.
which (Literal["obs", "var"], default "obs") – Whether to use the observations (“obs”) or variables (“var”) for the correlation.
basis (str, default "pca") – Dimensionality reduction to calculate correlation with. Must be a key in adata.obsm, or a basis available as “X_<basis>” such as “umap”, “tsne” or “pca”.
n_components (int, default None) – Number of components to use for the correlation.
columns (Optional[list[str]], default None) – List of columns to use for the correlation. If None, all numeric columns are used.
method (Literal["spearmanr", "pearson"], default "spearmanr") – Method to use for correlation. Must be either “pearsonr” or “spearmanr”.

Returns:

correlation coefficient, p-values

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Raises:

ValueError – If “basis” is not found in data, if “which” is not “obs” or “var”, if “method” is not “pearsonr” or “spearmanr”, or if “which” is “var” and “basis” not “pca”.
KeyError – If any of the given columns is not found in the respective table.

clustering

Module for cell clustering.

sctoolbox.tools.clustering.recluster(adata: AnnData, column: str, clusters: str | list[str], task: Literal['join', 'split'] = 'join', method: Literal['leiden', 'louvain'] = 'leiden', resolution: float | int = 1, key_added: str | None = None, plot: bool = True, embedding: str = 'X_umap') → None[source]

Recluster an anndata object based on an existing clustering column in .obs.

Parameters:

adata (sc.AnnData) – Annotated data matrix.
column (str | list[str]) – Column in adata.obs to use for re-clustering.
clusters (str | list[str]) – Clusters in column to re-cluster.
task (Literal["join", "split"], default "join") – Task to perform. Options are: - “join”: Join clusters in clusters into one cluster. - “split”: Split clusters in clusters are merged and then reclustered using method and resolution.
method (Literal["leiden", "louvain"], default "leiden") – Clustering method to use. Must be one of “leiden” or “louvain”.
resolution (float, default 1) – Resolution parameter for clustering.
key_added (Optional[str], default None) – Name of the new column in adata.obs. If None, the column name is set to <column>_recluster.
plot (bool, default True) – If a plot should be generated of the re-clustering.
embedding (str, default 'X_umap') – Select which embeding should be used.

Raises:

KeyError: – If the given embeding is not in the data. 1. If the given embedding is not in the data. 2. If given column is not found in adata.obs
ValueError: – If a given cluster is not found in the adata.

Example

import scanpy as sc
import sctoolbox

adata = sc.datasets.pbmc68k_reduced()
sctoolbox.tools.recluster(adata, column="louvain", clusters=["1", "5"], task="join", method="leiden", resolution=1.5, plot=True)

import scanpy as sc
import sctoolbox

adata = sc.datasets.pbmc68k_reduced()
sctoolbox.tools.recluster(adata, column="louvain", clusters=["2", "6"], task="split", method="leiden", resolution=1.5, plot=True)

Calculate the Gini coefficient of a numpy array.

Parameters:: x (npt.ArrayLike) – Array to calculate Gini coefficient for.
Returns:: Gini coefficient.
Return type:: float

sctoolbox.tools.clustering.calc_ragi(adata: AnnData, condition_column: str = 'clustering', binary_layer: str | None = None) → tuple[AnnData, float64][source]

Calculate the RAGI score over all clusters in the adata.

The RAGI score is a measure of how well a cluster is defined by a set of genes. The score is the mean of the Gini coefficients of the gene enrichments across the clusters. The functions uses binary sparse matrices ONLY. If the data is not binary, use sctoolbox.utils.binarize. Binary layers can be selected using the binary_layer parameter. The adata.var table also needs the total counts for each gene.

Parameters:

adata (sc.AnnData) – Annotated data matrix.
condition_column (str) – Column in adata.obs to use for clustering.
binary_layer (Optional[str], default None) – Layer in adata.layers to use for calculating gene enrichment. If None, the raw layer is used.

Returns:

Annotated data matrix with the Gini coefficients score in adata.var and RAGI score.

Return type:

Tuple[sc.AnnData, np.float64]

marker_genes

Tools for marker gene analyis.

sctoolbox.tools.marker_genes.get_chromosome_genes(gtf: str, chromosomes: str | list[str]) → list[str][source]

Get a list of all genes in the gtf for certain chromosome(s).

Parameters:

gtf (str) – Path to the gtf file.
chromosomes (str | list[str]) – A chromosome or a list of chromosome names to search for genes in.

Returns:

A list of all genes in the gtf for the given chromosome(s).

Return type:

list[str]

Notes

This function is not directly used by the framework, but is used to create the marker gene lists for ‘label_genes’.

Raises:: ValueError: – If not all given chromosomes are found in the GTF-file.

sctoolbox.tools.marker_genes.label_genes(adata: AnnData, species: str, gene_column: str | None = None) → list[str][source]

Label genes as ribosomal, mitochrondrial and gender genes.

Gene labels are added inplace.

Parameters:

adata (sc.AnnData) – The anndata object.
species (str) – Name of the species.
gene_column (Optional[str], default None) – Name of the column in adata.var that contains the gene names. Uses adata.var.index as default.

Returns:

List containing the column names added to adata.var.

Return type:

list[str]

celltype_annotation

Module for general celltype annotation.

sctoolbox.tools.celltype_annotation.add_cellxgene_annotation(adata: AnnData, csv: str) → None[source]

Add columns from cellxgene annotation to the adata .obs table.

Parameters:

adata (sc.AnnData) – Adata object to add annotations to.
csv (str) – Path to the annotation file from cellxgene containing cell annotation.

sctoolbox.tools.celltype_annotation.get_celltype_assignment(adata: AnnData, clustering: str, marker_genes_dict: dict[str, list[str]], column_name: str = 'celltype') → dict[str, str][source]

Get cell type assignment based on marker genes.

TODO make this more robust

Parameters:

adata (sc.AnnData) – Anndata object containing raw counts.
clustering (str) – Name of clustering column to use for cell type assignment.
marker_genes_dict (dict[str, list[str]]) – Dictionary containing cell type names as keys and lists of marker genes as values.
column_name (str, default: "celltype") – Name of column to add to adata.obs containing cell type assignment.

Returns:

Returns a dictionary with cluster-to-celltype mapping (key: cluster name, value: cell type) Also adds the cell type assignment to adata.obs[<column_name>] in place.

Return type:

dict[str, str]

sctoolbox.tools.celltype_annotation.run_scsa(adata: AnnData, gene_column: str | None = None, gene_symbol: Literal['auto', 'symbol', 'id'] = 'auto', key: str = 'rank_genes_groups', column_added: str = 'SCSA_pred_celltype', inplace: bool = True, python_path: str | None = None, species: Literal['Human', 'Mouse'] | None = 'Human', fc: float | int = 1.5, pvalue: float = 0.05, tissue: str = 'All', user_db: str | None = None, celltype_column: str = 'cell_name') → AnnData | None[source]

Run SCSA cell type annotation and assign cell types to cluster in an adata object.

This is a wrapper function that extracts ranked genes generated by scanpy.tl.rank_genes_groups function and generates input matrix for SCSA, then runs SCSA and assigns cell types to clusters in adata.obs.

Also adds adata.uns[‘SCSA’] as a dictionary with the following keys: - ‘results’: SCSA result table - ‘stderr’: SCSA stderr - ‘stdout’: SCSA stdout - ‘cmd’: SCSA command

Notes

SCSA sometimes gives ValueError: MultiIndex (as covered in https://github.com/bioinfo-ibms-pumc/SCSA/issues/19). This can be solved by downgrading pandas to 1.2.4.

Parameters:

adata (sc.AnnData) – Adata object to be annotated, must contain ranked genes in adata.uns
gene_column (str, default None) – Name of the column in adata.var that contains the gene names.
gene_symbol (str, default 'auto') – TODO Implement The type of gene symbol. One of “auto”, “symbol” (gene name) or “id” (ensembl id).
key (str, default 'rank_genes_groups') – The key in adata.uns where ranked genes are stored.
column_added (str, default 'SCSA_pred_celltype') – The column name in adata.obs where the cell types will be added.
inplace (bool, default True) – If True, cell types will be added to adata.obs.
python_path (str, default None) – SCSA parameter: Path to python. If not given, will be inferred from sys.executable.
species (Optional[Literal['Human', 'Mouse']], default 'Human') – SCSA parameter: Supports only Human or Mouse. Set to None to use the user defined database given in user_db.
fc (float, default 1.5) – SCSA parameter: Fold change threshold to filter genes.
pvalue (float, default 0.05) – SCSA parameter: P-value threshold to filter genes.
tissue (str, default 'All') – TODO Implement SCSA parameter: A specific tissue can be defined.
user_db (str, default None) – SCSA parameter: Path to the user defined marker database. Must contain at least two columns, one named “cell_name” (or set via celltype_column) for the cell type annotation, and at least one more column with gene names or ids (selected automatically from best gene overlap).
celltype_column (str, default 'cell_name') – SCSA parameter: The column name in the user_db that contains the cell type annotation.

Returns:

If inplace==False, returns adata with cell types in adata.obs Else return None

Return type:

Optional[sc.AnnData]

Raises:

KeyError: –
1. If key is not in adata.uns. 2. If ‘params’ is not in adata.uns[key] or if ‘groupby’ is not in adata.uns[key][‘params’]. 3. If gene column is not in adata.var
ValueError: –
1. If species parameter is not Human, Mouse or None. 2. If no species and no user database is provided. 3. If SCSA run failes

Additional

calc_overlap_fc

Module to calculate fold change of reads or fragments from a BAM or fragments file that overlap specified regions.

sctoolbox.tools.calc_overlap_fc.pct_fragments_in_promoters(adata: AnnData, gtf_file: str | None = None, bam_file: str | None = None, fragments_file: str | None = None, cb_col: str | None = None, cb_tag: str = 'CB', species: Literal['bos_taurus', 'caenorhabditis_elegans', 'canis_lupus_familiaris', 'danio_rerio', 'drosophila_melanogaster', 'gallus_gallus', 'homo_sapiens', 'mus_musculus', 'oryzias_latipes', 'rattus_norvegicus', 'sus_scrofa', 'xenopus_tropicalis'] | None = None, threads: int = 1) → None[source]

Calculate the fold change of fragments in promoters.

This function calculates for each cell, the fold change of fragments in a BAM alignment file that overlap with a promoter region specified in a GTF file. The results are added to the anndata object as new column. This is a wrapper function for fc_fragments_in_regions to use promoters.

Parameters:

adata (sc.AnnData) – The anndata object containig cell barcodes in adata.obs.
gtf_file (str, default None) – Path to GTF file for promoters regions. if None, the GTF file in flatfiles directory will be used.
bam_file (str, default None) – Path to BAM file. If None, a fragments file must be provided in the parameter ‘fragments_file’.
fragments_file (str, default None) – Path to fragments file. If None, a BAM file must be provided in the parameter ‘bam_file’. The BAM file will be converted into fragments file.
cb_col (str, default None) – The column in adata.obs containing cell barcodes. If None, adata.obs.index will be used.
cb_tag (str, default 'CB') – The tag where cell barcodes are saved in the bam file. Set to None if the barcodes are in read names.
species (str, default None) – Name of the species, will only be used if gtf_file is None to use internal GTF files. Species are {bos_taurus, caenorhabditis_elegans, canis_lupus_familiaris, danio_rerio, drosophila_melanogaster, gallus_gallus, homo_sapiens, mus_musculus, oryzias_latipes, rattus_norvegicus, sus_scrofa, xenopus_tropicalis}
threads (int, default 1) – Number of threads for parallelization. Will be used to convert BAM to fragments file.

Raises:

ValueError – If no species and no gtf_file is given.

Deprecated since version 0.4b: This will be removed in 0.5. Use the overlap_fragments_in_regions function instead.

sctoolbox.tools.calc_overlap_fc.fc_fragments_in_regions(adata: AnnData, regions_file: str, bam_file: str | None = None, fragments_file: str | None = None, cb_col: str | None = None, cb_tag: str = 'CB', regions_name: str = 'list', threads: int = 4, temp_dir: str | None = None) → None[source]

Calculate the fold change of fragments in a region against the background.

This function calculates the fold change of fragments overlapping a region against the background for each cell. The regions are specified in a BED or GTF file and the fragments should be provided by a fragments or BAM file. The results are added to the anndata object as new column.

Parameters:

adata (sc.AnnData) – The anndata object containig cell barcodes in adata.obs.
regions_file (str) – Path to BED or GTF file containing regions of interest.
bam_file (Optional[str], default None) – Path to BAM file. If None, a fragments file must be provided in the parameter ‘fragments_file’.
fragments_file (Optional[str], default None) – Path to fragments file. If None, a BAM file must be provided in the parameter ‘bam_file’. The BAM file will be converted into fragments file.
cb_col (Optional[str], default None) – The column in adata.obs containing cell barcodes. If None, adata.obs.index will be used.
cb_tag (str, default 'CB') – The tag where cell barcodes are saved in the bam file. Set to None if the barcodes are in read names.
regions_name (str, default 'list') – The name of the regions in the BED or GTF file (e.g. Exons). The name will be used as columns’ name added to the anndata object (e.g. pct_fragments_in_{regions_name}).
threads (int, default 1) – Number of threads for parallelization. Will be used to convert BAM to fragments file.
temp_dir (Optional[str], default None) – Path to temporary directory. Will use the current working directory by default.

Raises:

ValueError – If bam_file and fragment file is not provided.

sctoolbox.tools.calc_overlap_fc.count_fragments_per_cell(df: DataFrame, barcode_col: str = 'barcode', frag_count: str = 'count') → DataFrame[source]

Get counts per cell from a dataframe containing fragments.

The dataframe must have a column with cell barcodes and a column with fragment counts.

Parameters:

df (pd.DataFrame) – The dataframe of a bedfile containing the fragments.
barcode_col (str, default 'barcode') – The column name containing the cell barcodes.
frag_count (str, default 'count') – The column name containing the fragment counts.

Returns:

The dataframe containing the number of fragments per cell.

Return type:

pd.DataFrame

multiomics

Tools for multiomics analysis.

sctoolbox.tools.multiomics.merge_anndata(anndata_dict: dict[str, AnnData], join: Literal['inner', 'outer'] = 'inner') → AnnData[source]

Merge two h5ad files for dual cellxgene deplyoment.

Parameters:

anndata_dict (dict[str, sc.AnnData]) – Dictionary with labels as keys and anndata objects as values.
join (Literal['inner', 'outer'], default 'inner') –
Set how to join cells of the adata objects: [‘inner’, ‘outer’]. This only affects the cells since the var/gene section is simply added.
- ’inner’: only keep overlapping cells.
- ’outer’: keep all cells. This will add placeholder cells/dots to plots currently disabled.

Returns:

Merged anndata object.

Return type:

sc.AnnData

Notes

Important: Depending on the size of the anndata objects the function takes around 60 to 300 GB of RAM! To save RAM and runtime the function generates a minimal anndata object. Only .X, .var, .obs and .obsm are kept. Layers, .varm, etc is removed.

Raises:: ValueError: – If no indices of both adata.obs tables are overlapping.

bam

Functionality to split bam files into smaller bam files based on clustering in adata.obs.

sctoolbox.tools.bam.bam_adata_ov(adata: AnnData, bamfile: str, cb_tag: str = 'CB') → float[source]

Check if adata.obs barcodes existing in a column of a bamfile.

Parameters:

adata (sc.AnnData) – adata object where adata.obs is stored
bamfile (str) – path of the bamfile to investigate
cb_tag (str, default 'CB') – bamfile column to extract the barcodes from

Returns:

hitrate of the barcodes in the bamfile

Return type:

float

sctoolbox.tools.bam.check_barcode_tag(adata: AnnData, bamfile: str, cb_tag: str = 'CB') → None[source]

Check for the possibilty that the wrong barcode is used.

Parameters:

adata (sc.AnnData) – adata object where adata.obs is stored
bamfile (str) – path of the bamfile to investigate
cb_tag (str, default 'CB') – bamfile column to extract the barcodes from

Raises:

ValueError – If barcode hit rate could not be identified

sctoolbox.tools.bam.subset_bam(bam_in: str, bam_out: str, barcodes: Iterable[str], read_tag: str = 'CB', pysam_threads: int = 4, overwrite: bool = False) → None[source]

Subset a bam file based on a list of barcodes.

Parameters:

bam_in (str) – Path to input bam file.
bam_out (str) – Path to output bam file.
barcodes (Iterable[str]) – List of barcodes to keep.
read_tag (str, default "CB") – Tag in bam file to use for barcode.
pysam_threads (int, default 4) – Number of threads to use for pysam.
overwrite (bool, default False) – Overwrite output file if it exists.

sctoolbox.tools.bam.split_bam_clusters(adata: AnnData, bams: str | Iterable[str], groupby: str, barcode_col: str | None = None, read_tag: str = 'CB', output_prefix: str = 'split_', reader_threads: int = 1, writer_threads: int = 1, parallel: bool = False, pysam_threads: int = 4, buffer_size: int = 10000, max_queue_size: int = 1000, individual_pbars: bool = False, sort_bams: bool = False, index_bams: bool = False) → None[source]

Split BAM files into clusters based on ‘groupby’ from the anndata.obs table.

Parameters:

adata (sc.Anndata) – Annotated data matrix containing clustered cells in .obs.
bams (str | Iterable[str]) – One or more BAM files to split into clusters
groupby (str) – Name of a column in adata.obs to cluster the pseudobulks by.
barcode_col (str, default None) – Name of a column in adata.obs to use as barcodes. If None, use the index of .obs.
read_tag (str, default "CB") – Tag to use to identify the reads to split. Must match the barcodes of the barcode_col.
output_prefix (str, default split_) – Prefix to use for the output files.
reader_threads (int, default 1) – Number of threads to use for reading.
writer_threads (int, default 1) – Number of threads to use for writing.
parallel (bool, default False) – Whether to enable parallel processsing.
pysam_threads (int, default 4) – Number of threads for pysam.
buffer_size (int, default 10000) – The size of the buffer between readers and writers.
max_queue_size (int, default 1000) – The maximum size of the queue between readers and writers.
individual_pbars (bool, default False) – Whether to show a progress bar for each individual BAM file and output clusters. Default: False (overall progress bars).
sort_bams (bool, default False) – Sort reads in each output bam
index_bams (bool, default False) – Create an index file for each output bam. Will throw an error if sort_bams is False.

Raises:

ValueError: –

If groupby column is not in adata.obs 2. If barcode column is not in adata.obs 3. If index_bams is set and sort_bams is False

sctoolbox.tools.bam.open_bam(file: str, mode: str, verbosity: Literal[0, 1, 2, 3] = 3, **kwargs: Any) → pysam.AlignmentFile[source]

Open bam file with pysam.AlignmentFile. On a specific verbosity level.

Parameters:

file (str) – Path to bam file.
mode (str) – Mode to open the file in. See pysam.AlignmentFile
verbosity (Literal[0, 1, 2, 3], default 3) – Set verbosity level. Verbosity level 0 for no messages.
**kwargs (Any) – Forwarded to pysam.AlignmentFile

Returns:

Object to work on SAM/BAM files.

Return type:

pysam.AlignmentFile

sctoolbox.tools.bam.get_bam_reads(bam_obj: pysam.AlignmentFile) → int[source]

Get the number of reads from an open pysam.AlignmentFile.

Parameters:: bam_obj (pysam.AlignmentFile) – An open pysam.AlignmentFile object to get the number of reads from.
Returns:: number of reads in the bam file
Return type:: int

sctoolbox.tools.bam.bam_to_bigwig(bam: str, output: str | None = None, scale: bool = True, overwrite: bool = True, tempdir: str = '.', remove_temp: bool = True, bedtools_path: str | None = None, bgtobw_path: str | None = None) → str[source]

Convert reads in a bam-file to bigwig format.

Parameters:

bam (str) – Path to bam file.
output (Optional[str], default None) – Path to output file. If None, output is written to same directory as bam file with same name and .bw extension.
scale (bool, default True) – Scale output depth to reads per million mapped reads.
overwrite (bool, default True) – Overwrite output file if it already exists.
tempdir (str, default ".") – Path to directory where temporary files are written.
remove_temp (bool, default True) – Remove temporary files after conversion.
bedtools_path (Optional[str], default None) – Path to bedtools binary. If None, the function will search for the binary in the path.
bgtobw_path (Optional[str], default None) – Path to bedGraphToBigWig binary. If None, the function will search for the binary in the path.

Returns:

Path to output file.

Return type:

str

sctoolbox.tools.bam.create_fragment_file(bam: str, barcode_tag: str | None = 'CB', barcode_regex: str | None = None, outdir: str | None = None, nproc: int = 1, index: bool = False, min_dist: int = 10, max_dist: int = 5000, include_clipped: bool = True, shift_plus: int = 5, shift_minus: int = -4, keep_temp: bool = False) → str[source]

Create fragments file out of a BAM file.

This is an alternative to using the sinto package, which is slow and has issues with large bam-files.

Parameters:

bam (str) – Path to .bam file.
barcode_tag (Optional[str], default 'CB') – The tag where cell barcodes are saved in the bam file. Set to None if the barcodes are in read names.
barcode_regex (Optional[str], default None) – Regex to extract barcodes from read names. Set to None if barcodes are stored in a tag.
outdir (Optional[str], default None) – Path to save fragments file. If None, the file will be saved in the same folder as the .bam file. Temporary intermediate files are also written to this directory.
nproc (int, default 1) – Number of threads for parallelization.
index (bool, default False) – If True, index fragments file. Requires bgzip and tabix.
min_dist (int, default 10) – Minimum fragment length to consider.
max_dist (int, default 5000) – Maximum fragment length to consider.
include_clipped (bool, default True) – Whether to include soft-clipped bases in the fragment length. If True, the full length between reads is used. If False, the fragment length will be calculated from the aligned parts of the reads.
shift_plus (int, default 5) – Shift the start position of the forward read by this value (standard for ATAC).
shift_minus (int, default -4) – Shift the end position of the reverse read by this value (standard for ATAC).
keep_temp (bool, default False) – If True, keep temporary files.

Returns:

Path to fragments file.

Return type:

str

Raises:

ValueError – If both barcode_tag and barcode_regex (or neither) are set.
FileNotFoundError – If the input bam file does not exist.

frip

Module containing functions for calculating the FRiP score.

sctoolbox.tools.frip.calc_frip_scores(adata: AnnData, fragments: str, temp_dir: str = '') → tuple[AnnData, float][source]

Calculate the FRiP score for each barcode and adds it to adata.obs.

Parameters:

adata (sc.AnnData) – AnnData object containing the fragments
fragments (str) – path to fragments bedfile
temp_dir (str, default "") – path to temp directory

Returns:

AnnData object containing the fragments total FRiP score

Return type:

Tuple[sc.AnnData, float]

tsse

Module to calculate TSS enrichment scores.

sctoolbox.tools.tsse.write_TSS_bed(gtf: str, custom_TSS: str, negativ_shift: int = 2000, positiv_shift: int = 2000, temp_dir: str | None = None) → tuple[list, list][source]

Write a custom TSS file from a gene gtf file.

negativ_shift and positiv_shift are the number of bases of the flanks. The output is a bed file with the following columns: 1: chr, 2: start, 3:stop and a list of lists with the following columns: 1: chr, 2: start, 3:stop

Parameters:

gtf (str) – path to gtf file
custom_TSS (str) – path to output file
negativ_shift (int, default 2000) – number of bases to shift upstream
positiv_shift (int, default 2000) – number of bases to shift downstream
temp_dir (Optional[str], default None) – path to temporary directory

Returns:

list of lists with the following columns: 1: chr, 2: start, 3:stop

Return type:

Tuple[list, list]

sctoolbox.tools.tsse.overlap_and_aggregate(fragments: str, custom_TSS: str, overlap: str, tss_list: list[list[str | int]], negativ_shift: int = 2000, positiv_shift: int = 2000) → tuple[dict[str, list], list[str]][source]

Overlap the fragments with the custom TSS file and aggregates the fragments in a dictionary.

The dictionary has the following structure: {barcode: [tss_agg, n_fragments]} tss_agg is a numpy array with the aggregated fragments around the TSS with flanks of negativ_shift and positiv_shift. n_fragments is the number of fragments overlapping the TSS.

Parameters:

fragments (str) – path to fragments file
custom_TSS (str) – path to custom TSS file
overlap (str) – path to output file
tss_list (list[list[str | int]]) – list of lists with the following columns: 1: chr, 2: start, 3:stop
negativ_shift (int, default 2000) – number of bases to shift upstream
positiv_shift (int, default 2000) – number of bases to shift downstream

Returns:

dictionary with the following structure: {barcode: [tss_agg, n_fragments]} tss_agg is a numpy array with the aggregated fragments around the TSS with flanks of negativ_shift and positiv_shift. n_fragments is the number of fragments overlapping the TSS. list[str] contains a list of temporary files.

Return type:

Tuple[dict[str, list], list[str]]

Calculate per base tSSe by dividing the tSSe by the bias. The bias is calculated by averaging the edges of the tSSe.

The edges are defined by the edge_size.

Parameters:

tSSe_df (pd.DataFrame) – dataframe with the following columns: 1: barcode, 2: tSSe, 3: n_fragments
min_bias (float, default 0.01) – minimum bias to avoid division by zero
edge_size (int, default 100) – number of bases to use for the edges

Returns:

numpy array with the per base tSSe

Return type:

npt.ArrayLike

Calculate the global tSSe score by averaging the per base tSSe around the TSS.

Parameters:

per_base_tsse (npt.ArrayLike) – numpy array with the per base tSSe
negativ_shift (int) – number of bases to shift upstream
edge_size (int, default 50) – number of bases to use for the edges

Returns:

numpy array with the global tSSe score

Return type:

npt.ArrayLike

sctoolbox.tools.tsse.tsse_scoring(fragments: str, gtf: str, negativ_shift: int = 2000, positiv_shift: int = 2000, edge_size_total: int = 100, edge_size_per_base: int = 50, min_bias: float = 0.01, keep_tmp: bool = False, temp_dir: str = '') → DataFrame[source]

Calculate the tSSe score for each cell.

Calculating the TSSe score is done like described in: “Chromatin accessibility profiling by ATAC-seq” Fiorella et al. 2022

Parameters:

fragments (str) – path to fragments file
gtf (str) – path to gtf file
negativ_shift (int, default 2000) – number of bases to shift upstream
positiv_shift (int, default 2000) – number of bases to shift downstream
edge_size_total (int, default 100) – number of bases to use for the edges for the global tSSe score
edge_size_per_base (int, default 50) – number of bases to use for the edges for the per base tSSe score
min_bias (float, default 0.01) – minimum bias to avoid division by zero
keep_tmp (bool, default False) – keep temporary files
temp_dir (str, default "") – path to temporary directory

Returns:

dataframe with the following columns: 1: barcode, 2: tSSe, 3: n_fragments

Return type:

pd.DataFrame

sctoolbox.tools.tsse.add_tsse_score(adata: AnnData, fragments: str, gtf: str, negativ_shift: int = 2000, positiv_shift: int = 2000, edge_size_total: int = 100, edge_size_per_base: int = 50, min_bias: float = 0.01, keep_tmp: bool = False, temp_dir: str = '') → AnnData[source]

Add the tSSe score to the adata object.

Calculating the TSSe score is done like described in: “Chromatin accessibility profiling by ATAC-seq” Fiorella et al. 2022

Parameters:

adata (sc.AnnData) – AnnData object
fragments (str) – path to fragments.bed
gtf (str) – path to gtf file
negativ_shift (int, default 2000) – number of bases to shift upstream for the flanking regions
positiv_shift (int, default 2000) – number of bases to shift downstream for the flanking regions
edge_size_total (int, default 100) – number of bases to use for the edges for the global tSSe score
edge_size_per_base (int, default 50) – number of bases to use for the edges for the per base tSSe score
min_bias (float, default 0.01) – minimum bias to avoid division by zero
keep_tmp (bool, default False) – keep temporary files
temp_dir (str, default "") – path to temporary directory

Returns:

AnnData object with added tSSe score

Return type:

sc.AnnData

insertsize

Tools to calculate fragemnt- and insertsize for scATAC.

sctoolbox.tools.insertsize.add_insertsize(adata: AnnData, bam: str | None = None, fragments: str | None = None, barcode_col: str | None = None, barcode_tag: str = 'CB', regions: str | None = None) → None[source]

Add information on insertsize to the adata object using either a .bam-file or a fragments file.

Adds columns “insertsize_count” and “mean_insertsize” to adata.obs and a key “insertsize_distribution” to adata.uns containing the insertsize distribution as a pandas dataframe.

Parameters:

adata (sc.AnnData) – AnnData object to add insertsize information to.
bam (Optional[str], default None) – Path to bam file containing paired-end reads. If None, the fragments file is used instead.
fragments (Optional[str], default None) – Path to fragments file containing fragments information. If None, the bam file is used instead.
barcode_col (Optional[str], default None) – Column in adata.obs containing the name of the cell barcode. If barcode_col is None, it is assumed that the index of adata.obs contains the barcode.
barcode_tag (str, default 'CB') – Only for bamfiles: Tag used for the cell barcode for each read.
regions (Optional[str], default None) – Only for bamfiles: A list of regions to obtain reads from, e.g. [‘chr1:1-2000000’]. If None, all reads in the .bam-file are used.

Raises:

ValueError: –

If bam and fragments is given. 2. If bam and fragments is not given. 3. If no barcodes between bam- or fragment-file and adata overlap

peak_annotation

Tools for peak annotation required by scATAC-seq.

sctoolbox.tools.peak_annotation.annotate_adata(adata: AnnData, gtf: str, config: dict[str, Any] | None = None, best: bool = True, threads: int = 1, coordinate_cols: list[str] | None = None, temp_dir: str = '', remove_temp: bool = True, inplace: bool = True) → AnnData | None[source]

Annotate adata .var features with genes from .gtf using UROPA [1].

The function assumes that the adata.var contains genomic coordinates in the first three columns, e.g. “chromosome”, “start”, “end”. If specific columns should be used, please adjust the names via ‘coordinate_cols’.

Parameters:

adata (sc.AnnData) – The anndata object containing features to annotate.
gtf (str) – Path to .gtf file containing genomic elements for annotation.
config (Optional[dict[str, Any]], default None) – A dictionary indicating how regions should be annotated. Default (None) is to annotate feature ‘gene’ within -10000;1000bp of the gene start. See ‘Examples’ of how to set up a custom configuration dictionary.
best (bool, default True) – Whether to return the best annotation or all valid annotations.
threads (int, default 1) – Number of threads to use for multiprocessing.
coordinate_cols (Optional[list[str]], default None) – A list of column names in the regions DataFrame that contain the chromosome, start and end coordinates. If None the first three columns are taken.
temp_dir (str, default '') – Path to a directory to store files. Is only used if input .gtf-file needs sorting.
remove_temp (boolean, default True) – If True remove temporary directory after execution.
inplace (boolean, default True) – Whether to add the annotations to the adata object in place.

Returns:

If inplace == True, the annotation is added to adata.var in place. Else, a copy of the adata object is returned with the annotations added.

Return type:

Optional[sc.AnnData]

References

Examples

>>> custom_config = {"queries": [{"distance": [10000, 1000],
                                  "feature_anchor": "start",
                                  "feature": "gene"}],
                             "priority": True,
                             "show_attributes": "all"}

>>> annotate_regions(adata, gtf="genes.gtf",
                            config=custom_config)

sctoolbox.tools.peak_annotation.annotate_narrowPeak(filepath: str, gtf: str, config: dict[str, Any] | None = None, best: bool = True, threads: int = 1, temp_dir: str = '', remove_temp: bool = True) → DataFrame[source]

Annotate narrowPeak files with genes from .gtf using UROPA.

Parameters:

filepath (str) – Path to the narrowPeak file to be annotated.
gtf (str) – Path to the .gtf file containing the genes to be annotated.
config (Optional[dict[str, Any]], default None) – A dictionary indicating how regions should be annotated. Default (None) is to annotate feature ‘gene’ within -10000;1000bp of the gene start. See ‘Examples’ of how to set up a custom configuration dictionary.
best (bool, default True) – Whether to return the best annotation or all valid annotations.
threads (int, default 1) – Number of threads to perform the annotation.
temp_dir (str, default '') – Path to the directory where the temporary files should be written.
remove_temp (bool, default True) – If True remove temporary directory after execution.

Returns:

Dataframe containing the annotations.

Return type:

pd.DataFrame

receptor_ligand

Tools for a receptor-ligand analysis.

sctoolbox.tools.receptor_ligand.download_db(adata: AnnData, db_path: str, ligand_column: str, receptor_column: str, sep: str = '\t', inplace: bool = False, overwrite: bool = False) → AnnData | None[source]

Download table of receptor-ligand interactions and store in adata.

Parameters:

adata (sc.AnnData) – Analysis object the database will be added to.
db_path (str) – Path to database table. A valid database needs a column with receptor gene ids/symbols and ligand gene ids/symbols. Human: http://tcm.zju.edu.cn/celltalkdb/download/processed_data/human_lr_pair.txt Mouse: http://tcm.zju.edu.cn/celltalkdb/download/processed_data/mouse_lr_pair.txt
ligand_column (str) – Name of the column with ligand gene names. Use ‘ligand_gene_symbol’ for the urls provided above.
receptor_column (str) – Name of column with receptor gene names. Use ‘receptor_gene_symbol’ for the urls provided above.
sep (str, default 't') – Separator of database table.
inplace (bool, default False) – Whether to copy adata or modify it inplace.
overwrite (bool, default False) – If True will overwrite existing database.

Notes

This will remove all information stored in adata.uns[‘receptor-ligand’]

Returns:: If not inplace, return copy of adata with added database path and database table to adata.uns[‘receptor-ligand’]
Return type:: Optional[sc.AnnData]
Raises:: ValueError: – 1: If ligand_column is not in database. 2: If receptor_column is not in database.

sctoolbox.tools.receptor_ligand.calculate_interaction_table(adata: AnnData, cluster_column: str, gene_index: str | None = None, normalize: int = 1000, inplace: bool = False, overwrite: bool = False) → AnnData | None[source]

Calculate an interaction table of the clusters defined in adata.

Parameters:

adata (sc.AnnData) – AnnData object that holds the expression values and clustering
cluster_column (str) – Name of the cluster column in adata.obs.
gene_index (Optional[str], default None) – Column in adata.var that holds gene symbols/ ids. Corresponds to download_db(ligand_column, receptor_column). Uses index when None.
normalize (int, default 1000) – Correct clusters to given size.
inplace (bool, default False) – Whether to copy adata or modify it inplace.
overwrite (bool, default False) – If True will overwrite existing interaction table.

Returns:

If not inpalce, return copy of adata with added interactions table to adata.uns[‘receptor-ligand’][‘interactions’]

Return type:

Optional[sc.AnnData]

Raises:

ValueError: – 1: If receptor-ligand database cannot be found. 2: Id database genes do not match adata genes.
Exception: – If not interactions were found.

Generate violin plot of pairwise cluster interactions.

Parameters:

adata (sc.AnnData) – AnnData object
min_perc (int | float) – Minimum percentage of cells in a cluster that express the respective gene. A value from 0-100.
save (str, default None) – Output filename. Uses the internal ‘sctoolbox.settings.figure_dir’.
figsize (int tuple, default (5, 20)) – Figure size
dpi (float, default 100) – The resolution of the figure in dots-per-inch.

Returns:

Object containing all plots. As returned by matplotlib.pyplot.subplots

Return type:

npt.ArrayLike

Generate network graph of interactions between clusters.

Parameters:

adata (sc.AnnData) – AnnData object
min_perc (int | float) – Minimum percentage of cells in a cluster that express the respective gene. A value from 0-100.
interaction_score (float | int, default 0) – Interaction score must be above this threshold for the interaction to be counted in the graph.
interaction_perc (Optional[int | float], default None) – Select interaction scores above or equal to the given percentile. Will overwrite parameter interaction_score. A value from 0-100.
save (str, default None) – Output filename. Uses the internal ‘sctoolbox.settings.figure_dir’.
title (str, default 'Network') – The plots title.
color_min (float, default 0) – Min value for color range.
color_max (Optional[float | int], default None) – Max value for color range.
cbar_label (str, default 'Interaction count') – Label above the colorbar.
show_count (bool, default False) – Show the interaction count in the hairball.
restrict_to (Optional[list[str]], default None) – Only show given clusters provided in list.
additional_nodes (Optional[list[str]], default None) – List of additional node names displayed in the hairball.
hide_edges (Optional[list[Tuple[str, str]]], default None) – List of tuples with node names that should not have an edge shown. Order doesn’t matter. E.g. [(“a”, “b”)] to omit the edge between node a and b.

Returns:

Object containing all plots. As returned by matplotlib.pyplot.subplots

Return type:

npt.ArrayLike

Raises:

ValueError: – If restrict_to contains invalid clusters.

sctoolbox.tools.receptor_ligand.progress_violins(datalist: list[DataFrame], datalabel: list[str], cluster_a: str, cluster_b: str, min_perc: float | int, save: str, figsize: tuple[int | float, int | float] = (12, 6)) → str[source]

Show cluster interactions over timepoints.

CURRENTLY NOT FUNCTIONAL!

TODO Implement function

Parameters:

datalist (list[pd.DataFrame]) – List of interaction DataFrames. Each DataFrame represents a timepoint.
datalabel (list[str]) – List of strings. Used to label the violins.
cluster_a (str) – Name of the first interacting cluster.
cluster_b (str) – Name of the second interacting cluster.
min_perc (float | int) – Minimum percentage of cells in a cluster each gene must be expressed in.
save (str) – Path to output file.
figsize (Tuple[int, int], default (12, 6)) – Tuple of plot (width, height).

Return type:

str

sctoolbox.tools.receptor_ligand.interaction_progress(datalist: list[AnnData], datalabel: list[str], receptor: str, ligand: str, receptor_cluster: str, ligand_cluster: str, figsize: tuple[int | float, int | float] = (4, 4), dpi: int = 100, save: str | None = None) → Axes[source]

Barplot that shows the interaction score of a single interaction between two given clusters over multiple datasets.

TODO add checks & error messages

Parameters:

datalist (list[sc.AnnData]) – List of anndata objects.
datalabel (list[str]) – List of labels for the given datalist.
receptor (str) – Name of the receptor gene.
ligand (str) – Name of the ligand gene.
receptor_cluster (str) – Name of the receptor cluster.
ligand_cluster (str) – Name of the ligand cluster.
figsize (Tuple[int | float, int | float], default (4, 4)) – Figure size in inch.
dpi (int, default 100) – Dots per inch.
save (Optional[str], default None) – Output filename. Uses the internal ‘sctoolbox.settings.figure_dir’.

Returns:

The plotting object.

Return type:

matplotlib.axes.Axes

Show specific receptor-ligand connections between clusters.

Parameters:

adata (sc.AnnData) – AnnData object
restrict_to (Optional[list[str]], default None) – Restrict plot to given cluster names.
figsize (Tuple[int | float, int | float], default (10, 15)) – Figure size
dpi (float, default 100) – The resolution of the figure in dots-per-inch.
connection_alpha (str, default 'interaction_score') – Name of column that sets alpha value of lines between plots. None to disable.
save (Optional[str], default None) – Output filename. Uses the internal ‘sctoolbox.settings.figure_dir’.
title (Optional[str], default None) – Title of the plot
receptor_cluster_col (str, default 'receptor_cluster') – Name of column containing cluster names of receptors. Shown on x-axis.
receptor_col (str, default 'receptor_gene') – Name of column containing gene names of receptors. Shown on y-axis.
receptor_hue (str, default 'receptor_score') – Name of column containing receptor scores. Shown as point color.
receptor_size (str, default 'receptor_percent') – Name of column containing receptor expression percentage. Shown as point size.
receptor_genes (Optional[list[str]], default None) – Restrict receptors to given genes.
ligand_cluster_col (str, default 'ligand_cluster') – Name of column containing cluster names of ligands. Shown on x-axis.
ligand_col (str, default 'ligand_gene') – Name of column containing gene names of ligands. Shown on y-axis.
ligand_hue (str, default 'ligand_score') – Name of column containing ligand scores. Shown as point color.
ligand_size (str, default 'ligand_percent') – Name of column containing ligand expression percentage. Shown as point size.
ligand_genes (Optional[list[str]], default None) – Restrict ligands to given genes.
filter (Optional[str], default None) – Conditions to filter the interaction table on. E.g. ‘column_name > 5 & other_column < 2’. Forwarded to pandas.DataFrame.query.
lw_multiplier (int | float, default 2) – Linewidth multiplier.
wspace (float, default 0.4) – Width between plots. Fraction of total width.
line_colors (Optional[str], default 'rainbow') – Name of colormap used to color lines. All lines are black if None.

Returns:

Object containing all plots. As returned by matplotlib.pyplot.subplots

Return type:

npt.ArrayLike

Raises:

Exception: – If no onteractions between clsuters are found.

Get interaction table from anndata and apply filters.

Parameters:

anndata (sc.AnnData) – Anndata object to pull interaction table from.
min_perc (Optional[float | int], default None) – Minimum percent of cells in a cluster that express the ligand/ receptor gene. Value from 0-100.
interaction_score (Optional[float | int], default None) – Filter receptor-ligand interactions below given score. Ignored if interaction_perc is set.
interaction_perc (Optional[float | int], default None) – Filter receptor-ligand interactions below the given percentile. Overwrite interaction_score. Value from 0-100.
group_a (Optional[list[str]], default None) – List of cluster names that must be present in any given receptor-ligand interaction.
group_b (Optional[list[str]], default None) – List of cluster names that must be present in any given receptor-ligand interaction.
save (Optional[str], default None) – Output filename. Uses the internal ‘sctoolbox.settings.table_dir’.

Returns:

Table that contains interactions. Columns:

receptor_cluster = name of the receptor cluster

ligand_cluster = name of the ligand cluster

receptor_gene = name of the receptor gene

ligand_gene = name of the ligand gene

receptor_score = zscore of receptor gene cluster mean expression (scaled by cluster size)

ligand_score = zscore of ligand gene cluster mean expression (scaled by cluster size)

receptor_percent = percent of cells in cluster expressing receptor gene

ligand_percent = percent of cells in cluster expressing ligand gene

receptor_cluster_size = number of cells in receptor cluster

ligand_cluster_size = number of cells in ligand cluster

interaction_score = sum of receptor_score and ligand_score

Return type:

pd.DataFrame

gene_correlation

Tools for gene-gene correlation.

sctoolbox.tools.gene_correlation.correlate_conditions(adata: AnnData, gene: str, condition_col: str, condition_A: str, condition_B: str) → DataFrame[source]

Calculate the correlation of a gene expression over two conditions and compares the two conditions.

Parameters:

adata (sc.AnnData) – Annotated adata object.
gene (str) – Gene of interest.
condition_col (str) – Column in adata.obs containing conditions.
condition_A (str) – Name of the first condition.
condition_B (str) – Name of the second condition.

Returns:

Dataframe containing the correlation of a gene expression over two conditions.

Return type:

pd.DataFrame

Raises:

ValueError – If one or both condition columns are not in adata.obs.

sctoolbox.tools.gene_correlation.correlate_ref_vs_all(adata: AnnData, ref_gene: str, correlation_threshold: float = 0.4, save: str | None = None) → DataFrame[source]

Calculate the correlation of the reference gene vs all other genes.

Additionally, plots umap highlighting correlating gene expression.

Parameters:

adata (sc.AnnData) – Annotated data matrix.
ref_gene (str) – Reference gene.
correlation_threshold (float, default 0.4) – Threshold for correlation value. Correlating genes with a correlation >= threshold will be plotted if plot parameter is set to True.
save (str, default None) – Path to save the figure to.

Returns:

Dataframe containing correlation of refrence gene to other genes.

Return type:

pd.DataFrame

sctoolbox.tools.gene_correlation.compare_two_conditons(df_cond_A: DataFrame, df_cond_B: DataFrame, n_cells_A: int, n_cells_B: int) → DataFrame[source]

Compare two conditions.

Parameters:

df_cond_A (pd.DataFrame) – Dataframe containing correlation analysis from correlate_ref_vs_all
df_cond_B (pd.DataFrame) – Dataframe containing correlation analysis from correlate_ref_vs_all
n_cells_A (int) – Number of cells within condition A
n_cells_B (int) – Number of cells within condition B

Returns:

Dataframe containing single correlation and Fischer Z transformation

Return type:

pd.DataFrame

tobias

Tools for TOBIAS usage.

sctoolbox.tools.tobias.write_TOBIAS_config(out_path: str, bams: list[str] = [], names: list[str] | None = None, fasta: str | None = None, blacklist: str | None = None, gtf: str | None = None, motifs: str | None = None, organism: Literal['human', 'mouse', 'zebrafish'] = 'human', output: str = 'TOBIAS_output', plot_comparison: bool = True, plot_correction: bool = True, plot_venn: bool = True, coverage: bool = True, wilson: bool = True) → None[source]

Write a TOBIAS config file from input bams/fasta/blacklist etc.

Parameters:

out_path (str) – Path to output yaml file.
bams (list[str], default []) – List of paths to bam files.
names (Optional[list[str]], default None) – List of names for the bams. If None, the names are set to the bam file names with common prefix and suffix removed.
fasta (Optional[str], default None) – Path to fasta file.
blacklist (Optional[str], default None) – Path to blacklist file.
gtf (Optional[str], default None) – Path to gtf file.
motifs (Optional[str], default None) – Path to motif file.
organism (Literal["human", "mouse", "zebrafish"], default 'human') – Organism name. TOBIAS supports ‘human’, ‘mouse’ or ‘zebrafish’.
output (str, default 'Tobias_output') – Output directory of the TOBIAS run.
plot_comparison (bool, default True) – Flag for the step of plotting comparison of the TOBIAS run.
plot_correction (bool, default True) – Flag for the step of plotting correction of the TOBIAS run.
plot_venn (bool, default True) – Flag for the step of plotting venn diagramms of the TOBIAS run.
coverage (bool, default True) – Flag for coverage step of the TOBIAS run.
wilson (bool, default True) – Flag for wilson step of the TOBIAS run.