[ ]:
import sctoolbox
from sctoolbox.utils import bgcolor
01 - Assembling or loading anndata object
1 - Description
This notebook is dedicated to loading or creating an anndata object suitable for the subsequent analysis pipeline and optionally adding quality control-related metrics. The anndata object is prepared and finally stored as a .h5ad
file. Based on the available data files there are multiple options to create the anndata object. To satisfy all and especially ATAC-related functionalities indexes are prepared to hold barcodes and feature coordinates.
1.1 Sources to create the anndata object:
- ``.h5ad`` file:Choose this option if you have a
.h5ad
file. The file could be provided by a preprocessing pipeline, a public dataset or a preceding analysis. - Convert from R object:This option should be used if the data was processed using R. This can either be a
.rds
or.robj
file.
2.2 ATAC specific quality control metrics
- Nucleosome signal score:This ATAC specific quality control metric scores the fragment length distribution pattern of individual cells. This step utilizes PEAKQC for the score calculation. Choose if this score should be calculated and provide a file containing the fragments of the features. As input a bamfile with the raw reads or a bedfile containing fragments are suitable.
- Overlap with defined regions:Check for overlaps of the fragments with regions defined in a
.gtf
file. This could be promoters or transcription start sites. Choose if the overlap should be performed and provide the file holding the desired regions.
2 - Setup
[ ]:
# sctoolbox modules
import sctoolbox.calc_overlap_pct as overlap
import peakqc.fld_scoring as fld
from sctoolbox.qc_filter import *
from sctoolbox.atac_utils import *
import sctoolbox.utils as utils
utils.settings_from_config("config.yaml", key="01")
3 - Read in data
⬐ Fill in input data here ⬎
⬐ Fill in input data here ⬎
[ ]:
%bgcolor PowderBlue
# Choose one option
# For option 1: The path to an existing .h5ad file
path_h5ad = "test_data/scatac_pbmc.h5ad"
# For option 2: This is the path to the Seurat (.rds, .robj) file
path_rds = ""
[ ]:
if sum(map(lambda x: x != "", [path_h5ad, path_rds])) != 1:
del path_h5ad, path_quant, path_mtx, path_rds
raise ValueError("Please set only one of the above variables. Adjust the cell above and re-run.")
3.1 - Option 1: Read from h5ad
[ ]:
if path_h5ad:
adata = utils.load_h5ad(path_h5ad)
3.2 - Option 2: Convert from Seurat to anndata object
[ ]:
# Converting from Seurat to anndata object
if path_rds:
adata = converter.convertToAdata(file=path_rds)
4 - Prepare anndata
Rename or remove .obs
and .var
columns as needed and format their indices. After this step the index of .var
holds the feature coordinates and .obs
the cell barcodes.
[ ]:
import pandas as pd
with pd.option_context('display.max_rows', 5,'display.max_columns', None):
display(adata.obs)
display(adata.var)
⬐ Fill in input data here ⬎
⬐ Fill in input data here ⬎
[ ]:
%bgcolor PowderBlue
## 1. Modify existing columns
# .obs column names that should be deleted
drop_obs = []
# .obs column names that should be changed. E.g. "old_name": "New Name"
rename_obs = {}
# .var column names that should be deleted
drop_var = []
# .var column names that should be changed. E.g. "old_name": "New Name"
rename_var = {}
## 2. ATAC specific anndata properties
# columns where peak location data is stored (['chr', 'start', 'end'])
coordinate_cols = ['chr', 'start', 'end'] # (list:str)
# should the adata.var index be formatted, that it matches chr:start-stop
set_index = True # (boolean)
# should the .var index be generated from a certain column. Otherwise this is None
index_from = None # (str)
4.1 - Rename and delete columns
[ ]:
# change obs
obs = adata.obs.copy()
obs.drop(columns=drop_obs, inplace=True)
obs.rename(columns=rename_obs, errors='raise', inplace=True)
# change var
var = adata.var.copy()
var.drop(columns=drop_var, inplace=True)
var.rename(columns=rename_var, errors='raise', inplace=True)
# apply changes to adata
adata.obs = obs
adata.var = var
4.2 - Format anndata indices
[ ]:
adata = utils.prepare_atac_anndata(adata,
coordinate_cols=coordinate_cols,
set_index=set_index,
index_from=index_from)
5 - Add ATAC specific metrices
Add ATAC specific QC-metrics to the .obs
table.
⬐ Fill in input data here ⬎
⬐ Fill in input data here ⬎
[ ]:
%bgcolor PowderBlue
## 1. Source of fragments
# Either provide a bamfile or a bedfile containing fragments
fragments_file = 'test_data/scatac_pbmc_fragments.bed'
# If a bamfile was provided, which column contains the barcode information
barcode_tag = 'CB'
## 2. Choose actions to be done
# 2.1 calculate fragment length distribution score
calculate_fld_score = True
# 2.2 calculate overlap between fragments and regions
calculate_overlap = True
# Additional settings for the overlap
region_name = 'promoters'
regions_file = 'test_data/homo_sapiens.104.promoters2000.gtf'
# Number of threads available
threads = 8
5.1 - Check barcode tag
If a bamfile is provided this checks if the barcodes are available in the anndata object
[ ]:
use_bam = fragments_file.endswith("bam")
if use_bam:
check_barcode_tag(adata, fragments_file, barcode_tag)
5.2 - Score fragment length distributions
[ ]:
if calculate_fld_score:
fld.add_fld_metrics(adata=adata,
fragments=fragments_file,
barcode_col=None,
barcode_tag=barcode_tag,
chunk_size_bam=1000000,
regions=None,
peaks_thr=10,
wavelength=150,
sigma=0.4,
plot=False,
save_density=None,
save_overview=None,
sample=0)
adata.obs
5.3 - Calculate an overlap
[ ]:
if calculate_overlap:
if use_bam:
fc_fragments_in_regions(adata=adata,
regions_file=regions_file,
bam_file=fragments_file,
cb_col=None,
cb_tag=barcode_tag,
regions_name=region_name,
threads=threads,
temp_dir=None)
else:
fc_fragments_in_regions(adata=adata,
regions_file=regions_file,
fragments_file=fragments_file,
cb_col=None,
regions_name=region_name,
threads=threads,
temp_dir=None)
adata.obs
[ ]:
adata.obs
6 - Saving the anndata object
[ ]:
# Overview of loaded adata
display(adata)
[ ]:
# Saving the data
adata_output = "anndata_1.h5ad"
utils.save_h5ad(adata, adata_output)
[ ]:
sctoolbox.settings.close_logfile()