[ ]:
import sctoolbox
from sctoolbox.utilities import bgcolor
selected_pcs = None # set as default to prevent error in init cell

03 - Batch effect correction and comparisons


1 - Description

Batch effects are variances in the data that are not intended by the experimental design (e.g. technical variance). They can be introduced through various sources for example sequencing samples at different timepoints may introduce batch effects. As batch effects could interfere with downstream analysis they are typically removed. However, it can be challenging to identify and correct batch effects as this is highly dependent on the experimental setup of the dataset. Therefore, the main goal of this notebook is to aid in the identification and removal of potential batch effects.


2 - Setup

[ ]:
import pandas as pd

import scanpy as sc
import sctoolbox.utilities as utils
import sctoolbox.tools as tools
import sctoolbox.plotting as pl
utils.settings_from_config("config.yaml", key="03")

# Set additional options for figures
sc.set_figure_params(vector_friendly=True, dpi_save=600, scanpy=False)

3 - Load anndata

Loads the anndata.h5ad from the last notebook and provides a basic overview.

[ ]:
adata = utils.load_h5ad("anndata_2.h5ad")

with pd.option_context("display.max.rows", 5, "display.max.columns", None):
    display(adata)
    display(adata.obs)
    display(adata.var)

4 - General input

⬐ Fill in input data here ⬎

[ ]:
%bgcolor PowderBlue

# Set the species of the data
species = "human"

# Set number of cores to use for multiprocessing
threads = 4

# Options for highly variable genes
min_limit = 1000
max_limit = 5000
min_disp = 0.5
max_disp = float('inf')
min_mean = 0.0125
max_mean = 3

# Set number of neighbors
n_neighbors = 15

# Should preliminary clustering be performed?
do_clustering = True

# Options for batch correction
batch_column = "batch"  # A column in adata.obs containing batch information
perform_batch_correction = True
batch_methods = ["bbknn", "mnn", "harmony", "scanorama"] # , combat (excluded in this example due to slow runtime)

[ ]:
# Ensure that the batch column is of type category
adata.obs[batch_column] = adata.obs[batch_column].astype(str).astype("category")

5 - Plot highly expressed genes

Show the top expressing genes to decide whether they should be included during the normalization (they are not removed from the dataset just excluded during normalization).

[ ]:
sc.pl.highest_expr_genes(adata, show=False)
pl._save_figure("highly_expressed.pdf")

⬐ Fill in input data here ⬎

[ ]:
%bgcolor Powderblue

exclude_highly_expressed = True

6 - Normalization

Normalize the counts for each cell so that all cells have the same number of counts after normalization. This is done to remove the technical variation introduced by sequencing depth.

[ ]:
# Save raw layer before normalization
adata.layers["raw"] = adata.X.copy()
[ ]:
adata = tools.normalize_adata(adata, method="total", target_sum=None, exclude_highly_expressed=exclude_highly_expressed)["total"]

7 - Predict Cell Cycle

Predict the division phase of each cell.

[ ]:
tools.predict_cell_cycle(adata, species=species, s_genes=None, g2m_genes=None, inplace=True)
utils.add_uns_info(adata, "obs_metrics", ["phase"], how="append")

8 - Find highly variable genes

Identify genes that rapidly change expression levels between cells. These genes are considered to be highly informative about the underlying biological structure.

[ ]:
tools.annot_HVG(adata,
                hvg_range=(min_limit, max_limit),
                min_disp=min_disp,
                max_disp=max_disp,
                min_mean=min_mean,
                max_mean=max_mean,
                save="highly_variable.pdf")
[ ]:
# Number of variable genes selected
adata.var["highly_variable"].sum()

9 - PCA and neighbors for uncorrected data


This section computes the principal component analysis (PCA) as a first step of embedding the data. This is followed by computing the nearest neighbor graph. However, some principal components (PCs) of the PCA may be driven by none biological or other unwanted factors (e.g. number of active genes, cell cycle, etc.) as such they should be excluded from the following analysis.

[ ]:
# compute PCA
sc.pp.pca(adata, svd_solver='arpack', use_highly_variable=True)

The following PCA plot and heatmaps are intended to identify potentially unwanted PCs by showing the PCs in combination with available observations (cell-related metrics) and variables (gene-related metrics). In general, selected PCs should avoid correlations with metrics, but the importance of metrics and the stringency of thresholds depends on the experiment and the underlying questions, and therefore requires careful consideration by the analyst.

⬐ Fill in input data here ⬎

[ ]:
%bgcolor PowderBlue

# number of PCs shown within the heatmap
n_pcs_heatmap = 15

[ ]:
# Plot QC variables on the PCA embedding to show potential correlations
sc.pl.pca(adata, color=list(adata.uns["sctoolbox"]["obs_metrics"]) + [batch_column], ncols=3, show=False)
pl._save_figure("PCA_embedding.pdf")
[ ]:
# PCA correlations with obs variables
_ = pl.plot_pca_correlation(adata,
                            n_components=n_pcs_heatmap,
                            which="obs",
                            title="Correlation of .obs columns with PCA loadings",
                            save="PCA_correlation_obs.pdf")
[ ]:
# PCA correlations with var variables
_ = pl.plot_pca_correlation(adata,
                            n_components=n_pcs_heatmap,
                            which="var",
                            title="Correlation of .var columns with PCA loadings",
                            save="PCA_correlation_var.pdf")

9.1 - Choose a subset of PCs (optional)

In case the above plots showed undesired correlation this section can be used to subset the PCs.

Parameter

Description

Options

subset_pcs

Whether the PCs should be filtered.

True or False

corr_thresh

Highest absolute correlation that is allowed. Will take the maximum correlation for each PC as shown in the heatmap above.

Expects a value between 0-1.

perc_thresh

Top percentile of PCs that should be kept.

A value between 0-100%.

filter_methods

The PCs will be filtered based on the given methods. E.g. for “variance” and “correlation” PCs are filtered on values from both methods and the intersection is used as the final subset.

Any combination of ["variance", "cumulative variance", "correlation"]

basis

Compute correlation based on observations (cells) or variables (genes).

Either obs or var.

⬐ Fill in input data here ⬎

[ ]:
%bgcolor PowderBlue

# Whether PCs should be filtered
subset_pcs = True

corr_thresh = 0.3  # PCs with an absolut correlation above this will be filtered
perc_thresh = 50  # Top percentile of PCs that should be kept
filter_methods = ['variance', 'correlation']  # propose PCs based on the provided methods
basis = 'obs'  # base correlation on obs or var

[ ]:
selected_pcs = tools.propose_pcs(anndata=adata,
                                 how=filter_methods,
                                 corr_thresh=corr_thresh,
                                 perc_thresh=perc_thresh,
                                 corr_kwargs={'method': 'spearmanr', 'which': basis})

_ = pl.plot_pca_variance(adata,
                         selected=selected_pcs,
                         save='PCA_variance_proposed_selection.pdf',
                         n_pcs=50,
                         n_thresh=max(selected_pcs),
                         corr_plot='spearmanr',
                         corr_thresh=corr_thresh,
                         corr_on=basis
                        )
[ ]:
f"Proposed principal components: {selected_pcs}"

Create a final PC-selection by changing the blue cell below: - Either copy and adjust the proposed list from directly above - create a custom list of PCs - or accept the proposed list by not changing the cell below.

Note: the selection will only be applied when ``subset_pcs = True``.

⬐ Fill in input data here ⬎

[ ]:
%bgcolor PowderBlue

final_pc_selection = selected_pcs

[ ]:
_ = pl.plot_pca_variance(adata,
                         selected=final_pc_selection if subset_pcs else None,
                         save='PCA_variance_final_selection.pdf',
                         n_pcs=50,
                         n_thresh=max(selected_pcs) if subset_pcs else None,
                         corr_plot='spearmanr',
                         corr_thresh=corr_thresh if subset_pcs else None)
[ ]:
# Subset the number of pcs if chosen in the parameters
if subset_pcs:
    tools.subset_PCA(adata, select=final_pc_selection)

9.2 - Calculate neighbors

[ ]:
sc.pp.neighbors(adata, n_neighbors=n_neighbors)

10 - Batch correction


Batch correction is performed to remove technically introduced artifacts that would affect and potentially degrade the biological results of the data. There are several batch correction methods available, which may perform differently depending on the data set. Therefore, an overview is provided to compare batch correction methods and select the best performing one. To help in the decision making process, several metrics are shown that can be selected below and a score (LISI) is provided that explains whether the batches are well mixed after applying the correction.

[ ]:
if perform_batch_correction:
    batch_corrections = tools.wrap_corrections(adata,
                                              batch_key=batch_column,
                                              methods=batch_methods)
else:
    batch_corrections = {"uncorrected": adata}

10.1 - Plot overview of batch corrections

[ ]:
# Run standard umap for all adatas
tools.wrap_umap(batch_corrections.values(), threads=threads)

⬐ Fill in input data here ⬎

[ ]:
%bgcolor PowderBlue

# select additional metrics shown in the overview plot below
# accepts adata.obs column names or genes (adata.var.index)
color_by = ['total_counts',
            'pct_counts_is_ribo',
            'pct_counts_is_mito',
            'n_genes',
            'phase']

color_by = [batch_column] # + color_by

[ ]:
if do_clustering:
    for adata in batch_corrections.values():
        sc.tl.leiden(adata)
    color_by.append("leiden")
LISI score:
LISI score (stored in adata.obs) indicates the effective number of different categories represented in the local neighborhood of each cell. If the cells are well-mixed, then we expect the LISI score to be closer to n for a data with n batches.

The higher the LISI score is, the better batch correction method worked to normalize the batch effect and mix the cells from different batches.

[ ]:
# Calculate LISI scores for batch
tools.wrap_batch_evaluation(batch_corrections, batch_key=batch_column, threads=threads, inplace=True)
[ ]:
# Plot the overview of batch correction methods
_ = pl.anndata_overview(batch_corrections, color_by=color_by,
                        output="batch_correction_overview.pdf")

10.2 - Select the final object

Select the best suited clustering based on the overview plot above.

⬐ Fill in input data here ⬎

[ ]:
%bgcolor PowderBlue

selected = "harmony"

[ ]:
if not perform_batch_correction and selected != "uncorrected":
    import warnings
    warnings.warn(f"Selected batch correction '{selected}' but batch correction is disabled. Falling back to 'uncorrected'.")

    selected = "uncorrected"
elif selected not in batch_corrections:
    raise KeyError(f"'{selected}' is not a key in batch_corrections. Choose one of: {list(batch_corrections.keys())}")
[ ]:
adata = batch_corrections[selected]

11 - Saving adata for the next notebook

[ ]:
adata
[ ]:
#Saving the data
adata_output = "anndata_3.h5ad"
utils.save_h5ad(adata, adata_output)
[ ]:
sctoolbox.settings.close_logfile()