[ ]:

from sctoolbox.utilities import bgcolor
from sctoolbox import settings

Preparing adata for cellxgene / MaMPlan creation

1 - Description

1.1 - Preparing for cellxgene

This Notebook prepares the anndata object for cellxgene. This preparation includes: - Removing unnessesary data to keep the resulting h5ad file as small as possible - Renaming columns for a nicer presentation in cellxgene - Converting unsupported datatypes to supported datatypes - Additional fixes for bugs between scanpy, anndata and cellxgene

1.2 - MaMPlan creation

Additionally, a MaMPlan can be created which is needed to deploy the dataset to the BCU repository using mampok or the BCU repository overlay.
A MaMPlan acts as the config file for each specific dataset. It holds a variety of different parameters needed by mampok and the BCU repository.
To simplyfy the creation process only the importent parameters can be set. The other parameters get a (often) required default value.

See the MaMpok wiki for more detailed information about each parameter.

1.2.1 - Parameters

Parameter	Description	Options
project_id	Project ID, e.g. ‘ext123’, ‘dst123’	str
tool	Select the cellxgene docker container.	‘cellxgene-new’, ‘cellxgene-fix’, ‘cellxgene-vip-latest’
cluster	Select the kubernetes cluster.	‘BN’, ‘GI’, ‘GWDG’, ‘GWDGmanagt’, ‘BN_public’
organization	Select organizations related to the project. Every user in one of the organizations will be able to access the dataset via the BCU repository.	Options
label	Set label shown in the browser tab.	str
user	List of users that, additonally to the organization, get access to the dataset via the BCU repository.	List of LDAP user IDs
owner	Owner / Responsible person of the dataset. Set to public if public dataset.	LDAP user ID or public
analyst	Analyst of the dataset. If None, analyst is set as current user.	List of LDAP user ID; LDAP user ID or None
pubmedid	Pubmed ID of public datasets.	Pubmed user ID
citation	Citation of public dataset.	str
cpu_limit	Set the limit of cpu cores that can be used by the deplyoment.	int
mem_limit	Set the limit (in GB) of memory that can be used by the deplyoment.	int
cpu_request	Set the requested amount of cpu cores that can be used by the deplyoment.	int
mem_request	Set the requested amount (in GB) of memory that can be used by the deplyoment.	int
check_online	If True, validate certain parameters using an online database.	bool

2 - Setup

[ ]:

import sctoolbox.utilities as utils
from packaging import version
import pandas as pd

3 - General Input

⬐ Fill in input data here ⬎

[ ]:

%bgcolor PowderBlue

# sctoolbox settings
settings.adata_input_dir = "../adatas/"
settings.adata_output_dir = "../adatas/cellxgene/"
settings.log_file = "../logs/prepare_for_cellxgene_log.txt"
last_notebook_adata = "anndata_4.h5ad"
datatype = "scRNA"

# MaMPlan options

check_online = True

## Project options
project_id = "Test-ID"
tool = "cellxgene-fix" #cellxgene-vip-latest
cluster = "BN"
organization = ["AG-nerds"]
label = None
user = None
owner = "Test-owner"
analyst = None

## Options for public datasets
pubmedid = None
citation = None

## Options for computational resource manangemnt

### Limit
cpu_limit = None
mem_limit = None
### Requested
cpu_request = None
mem_request = None

mamplan_filename = f"{project_id}_MaMPlan.yaml"

4 - Load anndata

[ ]:

adata = utils.load_h5ad(last_notebook_adata)
display(adata)

5 - Prepare adata for cellxgene

The cellxgene preparation removes all data from the anndata object that is not required for the cellxgene deplyoment.

This saves memory on the cluster and decreases runtime.

In addition, every invalid or problematic datatype is checked for and cast to a fitting datatype if possible.

Note: Keep in mind that the resulting adata object should not be used for further analysis.

[ ]:

with pd.option_context("display.max.rows", 5, "display.max.columns", None):
    display(adata)
    display(adata.obs)
    display(adata.var)

⬐ Fill in input data here ⬎

[ ]:

%bgcolor PowderBlue

# Keep columns in adata.obs (Cell metadata)
keep_obs = [
    "sample",
    "batch",
    "celltype",
    "pct_counts_is_mito",
    "pct_counts_is_ribo",
    "phase",
    "clustering",
    "SCSA_pred_celltype",
    "marker_pred_celltype"
]

# Rename columns in adata.obs
rename_obs = {
    "sample": "Sample",
    "batch": "Batch",
    "celltype": "Celltype",
    "pct_counts_is_mito": "Mitochondiral content (%)",
    "pct_counts_is_ribo": "Ribosomal content (%)",
    "phase": "Phase",
    "clustering": "Final Clustering",
    "SCSA_pred_celltype": "Predicted Celltype (SCSA)",
    "marker_pred_celltype": "Predicted Celltype (Marker)"
}

# Keep columns in adata.var (Gene metadata)
# An empty list removes all columns
keep_var = []
rename_var = {}

5.1 - Add leiden columns

[ ]:

leiden_cols = [col for col in adata.obs.columns if col.startswith("leiden")]
keep_obs += leiden_cols
rename_obs |= {c: c.replace("_", " ").capitalize() for c in leiden_cols}

5.2 - Clean up adata

[ ]:

utils.prepare_for_cellxgene(adata,
                           keep_obs=keep_obs,
                           keep_var=keep_var,
                           rename_obs=rename_obs,
                           rename_var=rename_var,
                           inplace=True)

[ ]:

with pd.option_context("display.max.rows", 5, "display.max.columns", None):
    display(adata)
    display(adata.obs)
    display(adata.var)

5.3 - Save adata

[ ]:

#Saving the data
adata_output = f"{project_id}_cellxgene.h5ad"
utils.save_h5ad(adata, adata_output)

6 - Write MaMPlan

[ ]:

try:
    import mampok
    import mampok.mamplan_creator as mc
    if version.parse(mampok.__version__) < version.parse("2.0.9"):
        raise ModuleNotFoundError()
except ModuleNotFoundError:
    raise ModuleNotFoundError("Please install the latest mampok version.")

[ ]:

mamplan = mc.SimpleMamplan(
    exp_id = project_id,
    files = adata_output,
    tool = tool,
    analyst = analyst if analyst else utils.get_user(),
    datatype = datatype,
    cluster = cluster,
    label = label,
    organization = organization,
    user = user,
    owner = owner,
    pubmedid = pubmedid,
    citation = citation,
    cpu_limit = cpu_limit,
    mem_limit = mem_limit,
    cpu_request = cpu_request,
    mem_request = mem_request,
    check_online = check_online
)

[ ]:

mamplan.save(f"{settings.adata_output_dir}/{project_id}")

[ ]: