[ ]:

import sctoolbox
from sctoolbox.utilities import bgcolor

02 - QC and filtering

1 - Description

Quality control

We must ensure that all cellular barcode data correspond to viable cells.

To ensure this quality control (QC) is mandatory and for ATAC-seq data based on 4 key aspects:

The signal-to-noise, either via i) the enrichment of known regions, or ii) the determination of the ratio of fragments in peaks (FRiP).
The total number of unique fragments, also known as library complexity.
The fraction of reads derived from mitochondrial DNA vs. nuclear DNA.
The Fragment Length Distribution.

DOI: 10.1186/s13059-020-1929-3

On the single cell scale additional aspects, such as multiplets have to be taken into account.

DOI: 10.1038/s41467-021-21583-9

Based on QC related columns stored in the .obs we can filter for high quality cells based on all these aspects in this notebook.

Feature Selection

For subsequent processing steps such as dimension reduction, embedding and clustering, the filtering of features and the selection of highly variable features can be conducive.

Therefore this notebooks provides steps to exclude regions from sex chromosomes and mitochondrial chromosomes and to select highly variabele features.

2 - Setup

[ ]:

# sctoolbox modules
import sctoolbox.utils as utils
import sctoolbox.tools as tools
import sctoolbox.plotting as pl
import sctoolbox.tools.qc_filter as qc

import scanpy as sc
import matplotlib.pyplot as plt
import episcanpy as epi
import numpy as np
import pandas as pd
import scrublet as scr

utils.settings_from_config("config.yaml", key="02")

3 - Load anndata

Uses the anndata object written by the previous notebook.

[ ]:

adata = utils.load_h5ad("anndata_1.h5ad")

with pd.option_context("display.max.rows", 5, "display.max.columns", None):
    display(adata)
    display(adata.obs)
    display(adata.var)

4 - QC and filtering

⬐ Fill in input data here ⬎

[ ]:

%bgcolor PowderBlue

# Set the column in adata.obs containing the biological condition to evaluate
condition_column = "sample"

# Absolute minimum number of features for pre-selection of cells before QC
min_genes = 1

# Choose whether to binarize the X matrix
binarize_mtx = True  # True or False; convert matrix to binary

# Optional: Plot STARsolo quality if a path is given
quant_folder = ""

#----------------------- Doublet removal ------------------------

# Decide whether to remove doublets using scrublet (True) or to skip doublet calculation (False)
filter_doublets = True

# Use native scrublet or the scanpy wrapper (scanpy: optimized for RNA)
use_native_scrublet = True

# Default threshold to apply doublet removal on
doublet_threshold = 0.2

# Apply doublet filtering on each condition separately
use_condition_column = False

# If latter is False choose another column
condition_doublet_removal = None

# Available threads
threads = 2

[ ]:

# Ensure that condition column is a category
adata.obs[condition_column] = adata.obs[condition_column].astype("category")

4.1 - Show STARsolo quality (optional)

If the data was mapped using STARsolo, use the parameter to set the path to the STARsolo runs and plot quality measures across runs. The path must be a folder, e.g. “path/to/starsolo_output”, which contains folders per condition e.g. “cond1”, “cond2”, etc.

[ ]:

if quant_folder != "":
    _ = pl.plot_starsolo_quality(quant_folder, save="starsolo_quality.pdf")
    _ = pl.plot_starsolo_UMI(quant_folder, ncol=3, save="starsolo_cell_selection.pdf")

4.2 - Remove empty cells and features

[ ]:

print('original shape: ')
print(adata.shape)
print('Removing empty features and cells...')

adata = adata[adata.X.sum(axis=1) > 0]
adata = adata[:, adata.X.sum(axis=0) > 0]

print('new shape:')
print(adata.shape)

4.3 - Binarize matrix

[ ]:

# save raw matrix
adata.layers["raw"] = adata.X.copy()
# binarize
if binarize_mtx:
    epi.pp.binarize(adata)

4.4 Calculate and remove doublets

[ ]:

 if filter_doublets:

    if use_condition_column:
        condition_doublet_removal = condition_column

    if use_native_scrublet:
        # TODO: Implement Wrapper function for sctoolbox
        adata.obs['doublet_score'] = float('NaN')
        adata.obs['predicted_doublet'] = None

        sample_dict = {}
        for sample in adata.obs[condition_column].unique():
            print('Run scrublet for condition: ' + sample)
            X = adata.X[adata.obs[condition_column] == sample]
            scrub = scr.Scrublet(X)
            doublet_scores, predicted_doublets = scrub.scrub_doublets()
            adata.obs.loc[adata.obs[condition_column]==sample, 'doublet_score'] = doublet_scores
            adata.obs.loc[adata.obs[condition_column]==sample, 'predicted_doublet'] = predicted_doublets

        adata.obs['predicted_doublet'] = adata.obs['predicted_doublet'].astype(bool)

    else:
        qc.estimate_doublets(adata, groupby=condition_doublet_removal, threads=threads, threshold=doublet_threshold)

    #Remove the duplicates from adata
    tools.filter_cells(adata, "predicted_doublet", remove_bool=True)

[ ]:

# remove empty features
adata = adata[adata.X.sum(axis=1) > 0]
adata = adata[:, adata.X.sum(axis=0) > 0]

4.6 - Cell filtering

[ ]:

# Recalculate standard QC metrics (counts...)
adata = tools.calculate_qc_metrics(adata, var_type='features')

#Remove genes with 0 count
zero_bool = adata.var["n_cells_by_counts"] == 0
adata = adata[:,~zero_bool]

[ ]:

# available obs columns
adata.obs.columns

⬐ Fill in input data here ⬎

[ ]:

%bgcolor PowderBlue

# Decide whether to estimate thresholds individual per condition (False) or globally (True)
global_threshold = False

## Set default filter thresholds

# This will be applied to all samples - the thresholds can be changed manually when plotted
use_default_thresholds = True  # set to False to ignore default_thresholds
default_thresholds = {
                      'n_features': {'min': 100, 'max': 5000},
                      'fld_score_cwt': {'min': 0.01, 'max': 0.6}
                      # add additional columns if needed
                     }

4.6.1 Adapt thresholds

[ ]:

groupby = condition_column if global_threshold is False else None
initial_thresholds = tools.get_thresholds_wrapper(adata, default_thresholds,
                                                  only_automatic_thresholds=False, groupby=groupby)
obs_columns = list(initial_thresholds.keys())
tools.thresholds_as_table(initial_thresholds)

[ ]:

%matplotlib widget

#Plot violins and sliders
obs_figure, obs_slider_dict = pl.quality_violin(adata, columns=obs_columns,
                                                groupby=condition_column,
                                                which="obs",
                                                thresholds=initial_thresholds,
                                                global_threshold=global_threshold,
                                                title="Cell quality control (before)",
                                                save="cell_filtering.png")
obs_figure

4.6.2 - Apply gene filtering

[ ]:

# Get final thresholds
final_thresholds = pl.get_slider_thresholds(obs_slider_dict)
tools.thresholds_as_table(final_thresholds) # show thresholds

[ ]:

# Show pairwise comparisons of column values w/ thresholds (mean values in case thresholds are grouped)
%matplotlib inline
plt.close()  # close previous figure
if len(final_thresholds) > 1:
    mean_thresholds = qc.get_mean_thresholds(final_thresholds)
    _ = pl.pairwise_scatter(adata.obs, obs_columns, thresholds=mean_thresholds, save="cell_filtering_scatter.pdf")

[ ]:

tools.apply_qc_thresholds(adata, final_thresholds, groupby=groupby)

# remove empty features after cell filtering
adata = adata[:, adata.X.sum(axis=0) > 0]

4.6.3 - Show data after filtering

[ ]:

%matplotlib inline

#Plot violins and sliders
figure, slider_dict = pl.quality_violin(adata, columns=obs_columns,
                                        groupby=condition_column,
                                        which="obs", ncols=3,
                                        global_threshold = global_threshold,
                                        title="Cell quality control (after)",
                                        save="cell_filtering_final.pdf")
figure

5 - Feature processing

⬐ Fill in input data here ⬎

[ ]:

%bgcolor PowderBlue

# Removal of gene subsets
filter_chrM = True  # True or False; filtering out chrM
filter_xy = True    # True or False; filtering out chrX and chrY

# Highly Variable Features options
select_highly_variable = True
min_cells = 5 # This one is mandatory
max_cells = None

5.1 - Filter additional marked features

[ ]:

if filter_chrM:
    print("Removing chromosomal features...")
    non_m = [name for name in adata.var_names if not name.startswith('chrM')]  # remove chrM
    adata = adata[:, non_m]

if filter_xy:
    print("Removing gender related features...")
    non_xy = [name for name in adata.var_names if not name.startswith('chrY') | name.startswith('chrX')]
    adata = adata[:, non_xy]

5.2 - Select highly variable features

[ ]:

# update number of cells per feature
adata = tools.calculate_qc_metrics(adata, var_type='features')
if select_highly_variable:
    # get highly variable features
    tools.get_variable_features(adata, max_cells, min_cells)
    #Number of variable genes selected
    adata.var["highly_variable"].sum()
    # plot HVF violin
    pl.violin_HVF_distribution(adata)

6 - Save filtered adata

Store the final results

[ ]:

adata

[ ]:

#Saving the data
adata_output = "anndata_2.h5ad"
utils.save_h5ad(adata, adata_output)

[ ]:

sctoolbox.settings.close_logfile()