[ ]:
import sctoolbox
from sctoolbox.utilities import bgcolor
03 - Batch effect correction and comparisons
1 - Description
Batch effects are variances in the data that are not intended by the experimental design (e.g. technical variance). They can be introduced through various sources for example sequencing samples at different timepoints may introduce batch effects. As batch effects could interfere with downstream analysis they are typically removed. However, it can be challenging to identify and correct for batch effects as this is highly dependent on the experimental setup of the dataset. DOI: https://doi.org/10.1038/nrg2825
To determine the strength of a batch effect the Local Inverse Simpson’s Index (LISI) can be used by measuring the heterogenity. DOI: https://doi.org/10.1038/s41592-019-0619-0
Another important property of our data its the high dimensionality. For subsequent steps like the embedding and clustering. The dimensionality has to be reduced. To do that algorithms like Principle Component Analysis (PCA) can be used. DOI: https://doi.org/10.1038/nmeth.4346
This notebooks aims to prepare our data for the subsequent embedding and clustering, by normalization and batch correction. To infer the effects of the batch correction and normalization an embedding is calculated for visualization, beside the LISI score. For the embedding and following notebooks a PCA is performed and components selected.
2 - Setup
[ ]:
import sctoolbox.tools as tools
import sctoolbox.plotting as pl
import sctoolbox.utils as utils
import scanpy as sc
import episcanpy as epi
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
utils.settings_from_config("config.yaml", key="03")
sc.set_figure_params(vector_friendly=True, dpi_save=600, scanpy=False)
3 - Load anndata
Loads the anndata.h5ad from the last notebook and provides a basic overview.
[ ]:
adata = utils.load_h5ad("anndata_2.h5ad")
with pd.option_context("display.max.rows", 5, "display.max.columns", None):
display(adata)
display(adata.obs)
display(adata.var)
4 - General input
⬐ Fill in input data here ⬎
⬐ Fill in input data here ⬎
[ ]:
%bgcolor PowderBlue
# Choose normalization method
norm_methods = ['tfidf', 'total'] # can be 'tfidf' and/or 'total'
# Choose if highly variable features should be used
use_highly_variable=True
# Set number of neighbors
n_neighbors = 15 #Default=15
# UMAP related settings
metacol = 'sample'
# batch correction: If True, several batch correction methods will be performed,
# you can choose the best one after
batch_column = "sample"
perform_batch_correction = True
batch_methods = ["bbknn", "harmony"] # "mnn", "scanorama"
threads = 8
[ ]:
# Ensure that the batch column is of type category
adata.obs[batch_column] = adata.obs[batch_column].astype(str).astype("category")
5 - Normalization
This section performs the selected normalization methods and computes the PCA as a first embedding for each applied method.
[ ]:
normalizations = tools.normalize_adata(adata, norm_methods, use_highly_variable=use_highly_variable)
[ ]:
_ = pl.compare_embeddings(list(normalizations.values()), adata_names=list(normalizations.keys()),
var_list=[batch_column], embedding="pca")
6 - PCA and neighbors for uncorrected data
This section is about the selection of principal components (PCs) for the subsequent analysis. As lower number of PCs decreases the needed computing resources of many upcoming steps, PCs explaining low variance are excluded. However, some high variance explaining principal components (PCs) of the PCA may be driven by none biological or other unwanted factors (e.g. number of active genes, cell cycle, etc.) as such they should be excluded as well from the following analysis.
The following PCA plot and heatmaps are intended to identify potentially unwanted PCs by showing the PCs in combination with available observations (cell-related metrics) and variables (feature-related metrics). In general, selected PCs should avoid correlations with metrics, but the importance of metrics and the stringency of thresholds depends on the experiment and the underlying questions, and therefore requires careful consideration by the analyst.
[ ]:
%bgcolor PowderBlue
# number of PCs shown within the heatmap
n_pcs_heatmap = 15
6.1 - Plot correlations
[ ]:
# PCA correlations with obs and var variables
for method, adata_norm in normalizations.items():
_ = pl.plot_pca_correlation(adata_norm,
n_components=n_pcs_heatmap,
which="obs",
title=f"Normalization method = {method}\nCorrelation of .obs columns with PCA loadings",
save=f"PCA_{method}_correlation_obs.pdf")
_ = pl.plot_pca_correlation(adata_norm,
n_components=n_pcs_heatmap,
which="var",
title=f"Normalization method = {method}\nCorrelation of .var columns with PCA loadings",
save=f"PCA_{method}_correlation_var.pdf")
6.2 - Choose a subset of PCs
In case the above plots showed undesired correlation this section can be used to subset the PCs.
Parameter |
Description |
Options |
---|---|---|
subset_pcs |
Whether the PCs should be filtered. |
|
corr_thresh |
Highest absolute correlation that is allowed. Will take the maximum correlation for each PC as shown in the heatmap above. |
Expects a value between |
perc_thresh |
Top percentile of PCs that should be kept. |
A value between |
filter_methods |
The PCs will be filtered based on the given methods. E.g. for “variance” and “correlation” PCs are filtered on values from both methods and the intersection is used as the final subset. |
Any combination of |
basis |
Compute correlation based on observations (cells) or variables (genes). |
Either |
⬐ Fill in input data here ⬎
⬐ Fill in input data here ⬎
[ ]:
%bgcolor PowderBlue
# Whether PCs should be filtered
subset_pcs = True
corr_thresh = 0.3 # PCs with an absolut correlation above this will be filtered
perc_thresh = 50 # Top percentile of PCs that should be kept
filter_methods = ['variance', 'correlation'] # propose PCs based on the provided methods
basis = 'obs' # base correlation on obs or var
[ ]:
if subset_pcs:
for key, adata_normed in normalizations.items():
selected_pcs = tools.propose_pcs(anndata=adata_normed,
how=filter_methods,
corr_thresh=corr_thresh,
perc_thresh=perc_thresh,
corr_kwargs={'method': 'spearmanr', 'which': basis})
# Plot and select number of PCs
_, ax = plt.subplots()
ax.set_title(key, y=1.15)
_ = pl.plot_pca_variance(adata_normed,
save=key + "_PCA_variance_selected.pdf",selected=selected_pcs,
n_pcs=50,
n_thresh=max(selected_pcs),
corr_plot='spearmanr',
corr_thresh=corr_thresh,
corr_on=basis,
ax=ax)
tools.subset_PCA(adata_normed, select=selected_pcs)
normalizations[key] = adata_normed
6.3 - Calculate neighbors
[ ]:
for adata in normalizations.values():
sc.pp.neighbors(adata, n_neighbors=n_neighbors, method='umap', metric='euclidean')
6.4 - Wrap UMAP
[ ]:
tools.wrap_umap(normalizations.values())
6.5 - Choose a normalization
[ ]:
# Plot the overview of batch correction methods
_ = pl.anndata_overview(normalizations,
plots=["PCA", "PCA-var", "UMAP"],
color_by=["n_features"],
output=None)
⬐ Fill in input data here ⬎
⬐ Fill in input data here ⬎
[ ]:
%bgcolor PowderBlue
# Choose the final normalization method
norm_method = "tfidf"
[ ]:
adata = normalizations[norm_method]
sc.pl.pca_overview(adata, color=['n_features'], show=False)
_ = pl.plot_pca_correlation(adata, which="var",
title=f"Normalization method = {method}\nCorrelation of .var columns with PCA loadings",
save=f"PCA_{method}_correlation_var.pdf")
7 - Batch correction
Batch correction is performed to remove technically introduced artifacts that would affect and potentially degrade the biological results of the data. There are several batch correction methods available, which may perform differently depending on the data set. Therefore, an overview is provided to compare batch correction methods and select the best performing one. To help in the decision making process, several metrics are shown that can be selected below and a score (LISI) is provided that explains whether the batches are well mixed after applying the correction.
[ ]:
if perform_batch_correction:
batch_corrections = tools.wrap_corrections(adata,
batch_key=batch_column,
methods=batch_methods)
else:
batch_corrections = {"uncorrected": adata}
7.1 - Plot overview of batch corrections
[ ]:
#Run standard umap for all adatas
tools.wrap_umap(batch_corrections.values(), threads=threads)
⬐ Fill in input data here ⬎
⬐ Fill in input data here ⬎
[ ]:
%bgcolor PowderBlue
# Should preliminary clustering be performed?
do_clustering = True #True or False
# select additional metrics shown in the overview plot below
# accepts adata.obs column names or genes (adata.var.index)
color_by = ['total_counts',
'pct_counts_is_ribo',
'pct_counts_is_mito',
'n_genes',
'phase']
[ ]:
# Perform additional clustering if it was chosen
color_by = []
if do_clustering:
for adata in batch_corrections.values():
sc.tl.leiden(adata, 0.1)
color_by.append("leiden")
# Calculate LISI scores for batch
tools.wrap_batch_evaluation(batch_corrections, batch_key=batch_column, threads=threads, inplace=True)
[ ]:
#Plot the overview of batch correction methods
adata.obs[batch_column] = adata.obs[batch_column].astype("category") #ensure that batch column is a category
_ = pl.anndata_overview(batch_corrections, color_by=color_by + [batch_column],
output=None)
⬐ Fill in input data here ⬎
⬐ Fill in input data here ⬎
[ ]:
%bgcolor PowderBlue
#Choose an anndata object to proceed
selected = "bbknn"
[ ]:
adata_corrected = batch_corrections[selected]
[ ]:
if not perform_batch_correction and selected != "uncorrected":
import warnings
warnings.warn(f"Selected batch correction '{selected}' but batch correction is disabled. Falling back to 'uncorrected'.")
selected = "uncorrected"
elif selected not in batch_corrections:
raise KeyError(f"'{selected}' is not a key in batch_corrections. Choose one of: {list(batch_corrections.keys())}")
[ ]:
adata = batch_corrections[selected]
8 - Saving adata for the next notebook
[ ]:
adata
[ ]:
#Saving the data
adata_output = "anndata_3.h5ad"
utils.save_h5ad(adata, adata_output)
[ ]:
sctoolbox.settings.close_logfile()