Step 1: Peak pretrain (clustering)#

In this tutorial, we will introduce the peak pretraining step before applying the bridge integration, using the 10X Human PBMC data as an example. This function can also be used as a dimension reduction function independently.

[1]:
# !pip install --upgrade epipackpy
[2]:
import scanpy as sc
import numpy as np
import pandas as pd
[3]:
import epipackpy as epk
[4]:
epk.__version__
[4]:
'0.1.0dev5'

Read data and preprocessing#

First, we need to read the cell-by-peak matrix into Anndata structure. Since the dimension reduction is based on the binary matrix, the raw data file needs to be converted to the binary version by applying make_binary function. Here, we can directly read the data matrix .h5 file generated by 10X genomics Cellranger.

[5]:
peak_1 = epk.dt.read_10x_h5("data/pbmc_batch/pbmc_10k_nextgem/atac_pbmc_10k_nextgem_filtered_peak_bc_matrix.h5")
[6]:
peak_1 = epk.dt.make_binary(peak_1)

The raw count matrix will also be stored in the peak anndata at layers['raw'] variable.

[7]:
peak_1
[7]:
AnnData object with n_obs × n_vars = 9688 × 144023
    var: 'gene_ids', 'feature_types', 'genome'
    layers: 'raw'

Before training the binary autoencoder, we follow the same peak filtering process as PeakVI to filter out rarely detected peaks, to make the training faster.

[8]:
print("# regions before filtering:", peak_1.shape[-1])

# compute the threshold: 5% of the cells
min_cells = int(peak_1.shape[0] * 0.05)
# in-place filtering of regions
sc.pp.filter_genes(peak_1, min_cells=min_cells)

print("# regions after filtering:", peak_1.shape[-1])
# regions before filtering: 144023
# regions after filtering: 36728

Model training#

Next, we will start training the binary autoencoder model to obtain batch-specific latent embedding. If the peak matrix is in dense matrix format, we need to input anndata.X.todense() to the model.

[9]:
# model initialization
pretrain_peak_model = epk.ml.Peak_Model(count_enhancer=peak_1.X.todense())
- Model initializing...
- [Auto-detection]: GPU detetced. Model will be initialized in GPU mode.
- Model intialization completed.
[10]:
# model training
pretrain_peak_model.train_model(nepochs=200)
Epochs: 100%|██████████| 200/200 [05:03<00:00,  1.52s/it, loss_val=4.07e+3]

Visualizing the latent space#

We can visualize the latent space before integration or any other downstream analysis. We first extract the latent embedding.

[11]:
z = pretrain_peak_model.get_z()
peak_1.obsm['Z_epipack'] = z.detach().cpu().numpy()

Then we can cluster and visualize it by using Scanpy.

[12]:
# dimension reduction
sc.pp.neighbors(peak_1, use_rep='Z_epipack')
sc.tl.umap(peak_1)

# clustering
sc.tl.leiden(peak_1)
/nethome/ycheng430/miniconda3/envs/test/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[13]:
sc.pl.umap(peak_1, color=['leiden'])
/nethome/ycheng430/miniconda3/envs/test/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
/nethome/ycheng430/miniconda3/envs/test/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/clustering_tutorial_20_1.png
[ ]: