[ ]:
from sctoolbox.utilities import bgcolor
from sctoolbox import settings

Cell type annotation and marker list assembly


1 - Description

This Jupyter Notebook is designed for annotating cell types in clustered AnnData objects. It is divided into two main parts:

  • Marker List Assembly: This part is used when no existing marker lists are available. It enables users to assemble custom marker lists using the MarkerRepo.

  • Annotation: This section applies the created or provided marker lists to annotate cell types in AnnData objects.

The parameters are organized in three tables: 1. The first table contains basic parameters necessary for the annotation process. 2. The second table lists parameters specific to the Marker List Assembly section. 3. The third table lists parameters related to the Annotation section.

For a basic analysis, the parameters in the first table should be sufficient. However, for more advanced fine-tuning and detailed control of the analysis, the parameters in the second and third tables become critical.

1.1 - Parameter Overview

1.1.1 - Essential input data

Parameter

Description

Options

clustered_adata

Name of the clustered AnnData file for use.

String

clustering_column

.obs column used for cell type assignment.

None (select interactively) or String (e.g., "leiden")

celltype_column_name

Name for the column with the final cell type annotation. If None, keeps all annotation columns.

None or String (e.g., "pred_celltype")

marker_lists

Paths to marker lists. If None, assemble lists using MarkerRepo.

None or String or list of Strings (e.g., "/path/my_markers" or ["/heart_markers/markers", "/human/panglao"]

1.1.2 - Marker List Assembly: wrap.create_multiple_marker_lists

Parameter

Description

Options

organism

Specifies the organism for marker list assembly.

None or String (e.g., "human")

column_specific_terms

Search terms for marker list assembly, targeting specific columns.

None or Dictionary (e.g., {"Source": "panglao.se"})

cml_parameters

Additional parameters for marker list assembly. One marker list is created per dictionary.

None or List of dictionaries (e.g., [{"style":"two_column", "file_name":"two_column"}, {"style":"score", "file_name":"score"}]

repo_path

Path to MarkerRepo.

String

lists_path

Path to a custom marker lists folder. If None, the lists folder of the repo_path will be used.

None or String (e.g., "/path/my_markers")

style

The style of the marker lists. Options include “two_column” and “score”.

String

file_name

The name of the exported marker lists.

None (enter interactively) or String

ensembl

Use Ensembl IDs instead of gene symbols.

Boolean

force_homology

Create marker lists via homology even if lists for the organism exist.

Boolean

show_lists

Display the marker lists of the query.

Boolean

adata

Add marker list IDs to the .uns table of an AnnData object, if provided.

None or AnnData

If column_specific_terms and cml_parameters are None, you can assemble marker lists interactively.

The following columns are currently available for the MarkerRepo query: "ID", "List name", "Date", "Source", "Organism name", "Taxonomy ID", "Submitter name", "Email", "Tags", "Genotype", "Gender", "Life stage", "Tissue" and more.

1.1.3 - Annotation Parameters: wrap.run_annotation

Parameter

Description

Options/Type

adata

The AnnData object to annotate.

AnnData object

marker_repo

Use MarkerRepo for annotation.

Boolean

SCSA

Use SCSA for annotation.

Boolean

marker_lists

Paths to marker list files.

String or list of Strings (e.g., "/path/my_markers" or ["/heart_markers/markers", "/human/panglao"]

mr_obs

.obs prefix for MarkerRepo annotation.

String (e.g., “mr”)

scsa_obs

.obs prefix for SCSA annotation.

String (e.g., “scsa”)

rank_genes_column

Column of .uns table with rank genes scores. If None, the ranking will be performed on the clustering_column.

None or String

clustering_column

.obs column used for cell type assignment.

None (select interactively) or String (e.g., "leiden")

reference_obs

A reference annotation in .obs for comparison.

None or String

keep_all

If True, keeps all annotation columns.

Boolean

verbose

Enables printing of additional information.

Boolean

show_ct_tables

Shows additional MarkerRepo annotation tables with the first five top-ranked cell types per cluster.

Boolean

show_plots

Displays UMAP plots of the annotation, if available.

Boolean

show_comparison

Displays all annotations in one table.

Boolean

ignore_overwrite

Overwrites existing files without confirmation if True.

Boolean

celltype_column_name

Name for the column with the final cell type annotation. If None, keeps all annotation columns.

None or String (e.g., "pred_celltype")

For more information about MarkerRepo, click here.


2- Setup

[ ]:
import sctoolbox.utilities as utils
import pandas as pd
pd.set_option('display.max_columns', None)  #no limit to the number of columns shown
[ ]:
try:
    import markerrepo.wrappers as wrap
    import markerrepo.marker_repo as mr
except ModuleNotFoundError:
    raise ModuleNotFoundError("Please install the latest MarkerRepo version.")

⬐ Fill in input data here ⬎

[ ]:
%bgcolor PowderBlue

# sctoolbox settings
settings.adata_input_dir = "../adatas/"
settings.adata_output_dir = "../adatas/"

clustered_adata = "anndata_4.h5ad"

3 - Loading adata

[ ]:
adata = utils.load_h5ad(clustered_adata)
[ ]:
with pd.option_context("display.max.rows", 5, "display.max.columns", None):
    display(adata)
    display(adata.obs)
    display(adata.var)

4 - Essential Input

⬐ Fill in input data here ⬎

[ ]:
%bgcolor PowderBlue

# Annotation settings
clustering_column = "leiden_0.1"
celltype_column_name = None
marker_lists = None

# Marker list assembly
if not marker_lists:
    organism = "human"
    column_specific_terms = {"Organism name":organism, "Source":"panglao"}

    cml_parameters = [{"file_name":"panglao_two_column", "style":"two_column"},
                      {"file_name":"panglao_score", "style":"score"},
                      {"file_name":"tissues_two_column", "style":"two_column",
                       "column_specific_terms":{"Tissue":["skin", "blood"]}}]

    repo_path = "./test_data/marker_repo/"
    lists_path = "./test_data/marker_repo/marker_lists/"

5 - Assemble marker lists

The marker list paths are stored in the marker_lists variable. They work as input for the actual cell type annotation of the next cell.

[ ]:
if not marker_lists:
    marker_lists = wrap.create_multiple_marker_lists(
        cml_parameters=cml_parameters,
        repo_path=repo_path,
        lists_path=lists_path,
        organism=organism,
        ensembl=mr.check_ensembl(adata),
        column_specific_terms=column_specific_terms,
        show_lists=True
    )

6 - Annotate adata


⬐ Fill in input data here ⬎

[ ]:
%bgcolor PowderBlue

marker_repo = True
SCSA = True
mr_obs = "MR"
scsa_obs = "SCSA"
rank_genes_column = None
reference_obs = None
show_comparison = True
ignore_overwrite = True
show_plots = True

[ ]:
wrap.run_annotation(
    adata,
    marker_repo=marker_repo,
    SCSA=SCSA,
    marker_lists=marker_lists,
    mr_obs=mr_obs,
    scsa_obs=scsa_obs,
    rank_genes_column=rank_genes_column,
    clustering_column=clustering_column,
    reference_obs=reference_obs,
    show_comparison=show_comparison,
    ignore_overwrite=ignore_overwrite,
    show_plots=show_plots,
    celltype_column_name=celltype_column_name
)

6.1 - Show annotated .obs table

[ ]:
display(adata.obs)

7 - Save adata

[ ]:
utils.save_h5ad(adata, "anndata_annotated.h5ad")