crispyx Tutorial

Genome-wide CRISPR screens routinely produce datasets with hundreds of thousands of cells and tens of thousands of genes. Standard single-cell toolkits (Scanpy, Pertpy) load the entire count matrix into memory, which can require 30–100+ GB of RAM. crispyx streams data directly from on-disk AnnData .h5ad files so that QC, normalisation, pseudo-bulk aggregation, and differential expression all run without materialising the full matrix.

This tutorial walks through a complete crispyx workflow on the included Adamson subset dataset, covering:

  1. Data loading and inspection

  2. Data preparation utilities

  3. Quality control

  4. Streaming normalisation

  5. Dimension reduction (PCA, KNN, UMAP)

  6. Pseudo-bulk aggregation

  7. Differential expression (Wilcoxon and NB-GLM)

  8. LFC shrinkage

  9. Plotting

For the full API reference, see the documentation.

Setup

Install the project in editable mode (with the optional test extras) to make the package importable:

pip install -e .[test]

1. Load the dataset

This tutorial uses the included Adamson subset stored at data/Adamson_subset.h5ad. cx.read_h5ad_ondisk returns a read-only wrapper — the expression matrix stays on disk.

[1]:
import sys
from pathlib import Path
import pandas as pd

ROOT = Path('..').resolve()
sys.path.insert(0, str(ROOT))
sys.path.insert(0, str(ROOT / 'src'))
DATA_PATH = ROOT / 'data' / 'Adamson_subset.h5ad'
OUTPUT_DIR = ROOT / 'docs' / 'tutorial_outputs'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

import crispyx as cx

if not DATA_PATH.exists():
    raise FileNotFoundError(f'Expected dataset at {DATA_PATH}')

Inspect metadata

The obs and var accessors expose head and load helpers so you can preview metadata without materialising the entire object.

[2]:
adata_ro = cx.read_h5ad_ondisk(DATA_PATH)
adata_ro
AnnData object with n_obs × n_vars = 1716 × 11630 backed at '/Users/dujinhong/Library/CloudStorage/OneDrive-TheUniversityOfHongKong/Streamlining-CRISPR-Screen-Analysis/Streamlining-CRISPR-Screen-Analysis/data/Adamson_subset.h5ad'
    obs: 'perturbation', 'n_genes', 'dataset', 'celltype'
    var: 'Ensembl_ID'
First obs rows:
                 perturbation  n_genes  dataset celltype
Cell_barcodes
AAACATACAAGATG-1      control     2914  Adamson     K562
AAACATTGCATTGG-1      control     3999  Adamson     K562
AAACGGCTAATGCC-1       AMIGO3     3947  Adamson     K562
AAAGACGAACTCAG-1      control     2713  Adamson     K562
AAAGAGACTCGCCT-1      control     3678  Adamson     K562
First var rows:
                  Ensembl_ID
Gene_symbol
LINC00115    ENSG00000225880
NOC2L        ENSG00000188976
KLHL17       ENSG00000187961
HES4         ENSG00000188290
ISG15        ENSG00000187608
[2]:
AnnData(path=/Users/dujinhong/Library/CloudStorage/OneDrive-TheUniversityOfHongKong/Streamlining-CRISPR-Screen-Analysis/Streamlining-CRISPR-Screen-Analysis/data/Adamson_subset.h5ad, mode='r')
[3]:
adata_ro.obs.head()
[3]:
perturbation n_genes dataset celltype
Cell_barcodes
AAACATACAAGATG-1 control 2914 Adamson K562
AAACATTGCATTGG-1 control 3999 Adamson K562
AAACGGCTAATGCC-1 AMIGO3 3947 Adamson K562
AAAGACGAACTCAG-1 control 2713 Adamson K562
AAAGAGACTCGCCT-1 control 3678 Adamson K562
[4]:
adata_ro.var.head()
[4]:
Ensembl_ID
Gene_symbol
LINC00115 ENSG00000225880
NOC2L ENSG00000188976
KLHL17 ENSG00000187961
HES4 ENSG00000188290
ISG15 ENSG00000187608

2. Data Preparation Utilities

crispyx provides helpers to inspect and standardise metadata in backed h5ad files without loading the expression matrix.

[5]:
# Feature 1: load obs/var without reading X
obs = cx.load_obs(DATA_PATH)
print('obs columns:', list(obs.columns))
print('obs shape:  ', obs.shape)

var = cx.load_var(DATA_PATH)
print('var columns:', list(var.columns))
print('var shape:  ', var.shape)
obs columns: ['perturbation', 'n_genes', 'dataset', 'celltype']
obs shape:   (1716, 4)
var columns: ['Ensembl_ID']
var shape:   (11630, 1)
[6]:
# Feature 4: auto-detect perturbation and gene columns
cols = cx.infer_columns(DATA_PATH)
print('Detected columns:', cols)
Detected columns: {'perturbation_column': 'perturbation', 'gene_name_column': None}
[7]:
# Feature 3: normalise perturbation labels (inplace=False for preview)
normalised = cx.normalise_perturbation_labels(
    DATA_PATH,
    column='perturbation',
    canonical_control='NTC',
    inplace=False,
)
print(normalised.value_counts().head(10))
perturbation
AMIGO3    750
NTC       500
AARS      466
Name: count, dtype: int64
[8]:
# Feature 5: overlap analysis between gene sets from different perturbations
# (Using toy sets for illustration; in practice, fill with DE-significant genes)
de_sets = {
    'AARS hits':  {'ATF4', 'DDIT3', 'ASNS', 'EIF2AK4', 'SLC7A5'},
    'AMIGO3 hits': {'ATF4', 'DDIT3', 'CDK4', 'CDKN1A', 'EGFR'},
    'Random set': {'BRCA1', 'TP53', 'KRAS', 'EGFR', 'MYC'},
}

result = cx.tl.compute_overlap(de_sets)
print('Jaccard matrix:')
print(result.jaccard_matrix.round(3))
print('\nSet sizes:', result.set_sizes.to_dict())

ax = cx.pl.overlap_heatmap(result, metric='jaccard', title='Gene-set overlap (Jaccard)')
Jaccard matrix:
             AARS hits  AMIGO3 hits  Random set
AARS hits         1.00        0.250       0.000
AMIGO3 hits       0.25        1.000       0.111
Random set        0.00        0.111       1.000

Set sizes: {'AARS hits': 5, 'AMIGO3 hits': 5, 'Random set': 5}
_images/crispyx_tutorial_12_1.png

3. Quality Control

The cx.pp preprocessing namespace mirrors Scanpy’s API. Each function returns a new cx.AnnData object backed by an .h5ad file on disk.

[9]:
qc_result = cx.quality_control_summary(
    adata_ro,
    perturbation_column='perturbation',
    min_genes=5,
    min_cells_per_perturbation=5,
    output_dir=OUTPUT_DIR,
    data_name='tutorial',
)
qc_preview = pd.DataFrame(
    {
        'perturbation': adata_ro.obs['perturbation'],
        'qc_pass': qc_result.cell_mask,
    }
)
adata_ro = qc_result.filtered
print(f"QC result: {qc_result.cell_mask.sum()} / {len(qc_result.cell_mask)} cells passed")
qc_preview.head(5)
QC result: 1716 / 1716 cells passed
[9]:
perturbation qc_pass
Cell_barcodes
AAACATACAAGATG-1 control True
AAACATTGCATTGG-1 control True
AAACGGCTAATGCC-1 AMIGO3 True
AAAGACGAACTCAG-1 control True
AAAGAGACTCGCCT-1 control True

4. Streaming Normalisation

For large datasets that would OOM during normalisation, use the streaming preprocessing function. This is the equivalent of scanpy.pp.normalize_total() + scanpy.pp.log1p() but processes data in chunks.

[10]:
# Create a normalized + log1p version of the QC-filtered data (streaming)
# This works on huge datasets without loading into memory
adata_norm = cx.pp.normalize_total_log1p(
    adata_ro,  # Pass the AnnData object directly (same as other cx.pp functions)
    output_dir=OUTPUT_DIR,
    data_name='tutorial',
    target_sum=1e4,  # Same as scanpy default
    verbose=True,
)

# Verify the output - adata_norm is a read-only AnnData wrapper
print(f"Shape: {adata_norm.shape}")
print(f"First 3 cells, first 3 genes:\n{adata_norm.to_memory().X[:3, :3].toarray()}")
Generating preprocessed dataset (streaming, normalize+log1p): /Users/dujinhong/Library/CloudStorage/OneDrive-TheUniversityOfHongKong/Streamlining-CRISPR-Screen-Analysis/Streamlining-CRISPR-Screen-Analysis/docs/tutorial_outputs/crispyx_tutorial_normalized_log1p.h5ad
  ✓ Preprocessed dataset written: 1716 cells × 9238 genes
Shape: (1716, 9238)
First 3 cells, first 3 genes:
[[0.75838053 0.         0.        ]
 [0.49522132 0.         0.        ]
 [1.4766263  0.         0.5161892 ]]

5. Dimension Reduction

crispyx provides streaming PCA and KNN graph construction that works on backed (on-disk) AnnData objects.

  • Streaming PCA: Automatically selects the optimal method based on gene count (sparse_cov for ≤15K genes, incremental for larger).

  • KNN neighbors: Builds k-nearest neighbor graph from PCA embeddings for downstream clustering/UMAP.

Results are written directly to the h5ad file using a close-write-reopen pattern.

[11]:
# Streaming PCA on the normalized data
# Results are written directly to the h5ad file (close-write-reopen pattern)
cx.pp.pca(adata_norm, n_comps=20, show_progress=True)

# Check PCA results (data is read from the h5ad file)
print(f"PCA embeddings shape: {adata_norm.obsm['X_pca'].shape}")
print(f"Variance explained: {adata_norm.uns['pca']['variance_ratio'][:5]}")  # First 5 components

# Build KNN graph from PCA embeddings (also writes to h5ad)
cx.pp.neighbors(adata_norm, n_neighbors=10, method='sklearn', show_progress=True)

print(f"KNN distances shape: {adata_norm.obsp['distances'].shape}")
print(f"Number of connections: {adata_norm.obsp['connectivities'].nnz}")
PCA embeddings shape: (1716, 20)
Variance explained: [0.02978331 0.01700597 0.01289528 0.01072035 0.00821617]
KNN distances shape: (1716, 1716)
Number of connections: 25228

PCA Visualization

The cx.pl namespace provides Scanpy-style plotting wrappers:

  • cx.pl.pca(): Scatter plot of PCA embeddings

  • cx.pl.pca_variance_ratio(): Explained variance per component

  • cx.pl.pca_loadings(): Gene loadings for top components

[12]:
# Plot PCA variance explained
cx.pl.pca_variance_ratio(adata_norm, n_pcs=15)

# Plot PCA scatter colored by perturbation
cx.pl.pca(adata_norm, color='perturbation', components='1,2')

# Plot gene loadings for first 3 components
cx.pl.pca_loadings(adata_norm, components=[1, 2, 3])
_images/crispyx_tutorial_20_0.png
_images/crispyx_tutorial_20_1.png
_images/crispyx_tutorial_20_2.png

UMAP Visualization

After computing PCA and building the neighbor graph, we can compute UMAP embeddings for visualization. cx.tl.umap() uses the pre-computed neighbor graph from cx.pp.neighbors(), so it only loads the neighbor matrix (not the full expression matrix) into memory.

This is memory-efficient: approximately 0.75 MB per 1000 cells for 15 neighbors.

[13]:
# Compute UMAP from neighbor graph (memory-efficient)
cx.tl.umap(adata_norm, min_dist=0.5, spread=1.0)

# Check UMAP results
print(f"UMAP embedding shape: {adata_norm.obsm['X_umap'].shape}")

# Plot UMAP colored by perturbation
cx.pl.umap(adata_norm, color='perturbation')
WARNING: .obsp["connectivities"] have not been computed using umap
UMAP embedding shape: (1716, 2)
_images/crispyx_tutorial_22_1.png

Pseudobulk aggregation with cx.pb

Aggregate cells per perturbation on disk. The returned object can be inspected lazily, similar to the original dataset.

[14]:
adata_pb_ro = cx.pb.average_log_expression(
    adata_ro,
    perturbation_column='perturbation',
    output_dir=OUTPUT_DIR,
    data_name='tutorial',
)
print(f"Pseudobulk shape: {adata_pb_ro.shape}")
adata_pb_ro.var.head(5)
Pseudobulk shape: (2, 9238)
[14]:
Gene_symbol
NOC2L
KLHL17
HES4
ISG15
AGRN

Differential expression with cx.tl

Run a Wilcoxon test that mirrors scanpy.tl.rank_genes_groups. cx.AnnData.uns entries also provide preview and load helpers.

[15]:
adata_ro = cx.tl.rank_genes_groups(
    adata_norm,  # Wilcoxon expects log-normalized data
    perturbation_column='perturbation',
    method='wilcoxon',
    output_dir=OUTPUT_DIR,
    data_name='tutorial',
)

# Preview a tidy DE table using plotting helper (no scanpy_format needed)
group = adata_ro.obs.load()['perturbation'].astype(str).unique()[0]
cx.pl.rank_genes_groups_df(adata_ro, group=group, n_genes=10).head()

[15]:
names scores logfoldchanges pvals pvals_adj pts pts_rest group
0 SET -10.899647 -0.506634 1.157049e-27 8.419143e-24 0.990667 0.988 AMIGO3
1 MT-CO2 10.858218 0.402062 1.822720e-27 8.419143e-24 0.998667 0.998 AMIGO3
2 RPS27A 10.551140 0.267033 5.018421e-26 1.022991e-22 1.000000 1.000 AMIGO3
3 EIF1 10.539785 0.288907 5.662821e-26 1.022991e-22 1.000000 1.000 AMIGO3
4 MT-CYB 10.534347 0.327073 5.999814e-26 1.022991e-22 0.998667 0.998 AMIGO3

Use load to materialise the complete differential expression tables for downstream analysis.

[16]:
group = adata_ro.obs.load()['perturbation'].astype(str).unique()[0]
top_df = cx.pl.rank_genes_groups_df(adata_ro, group=group, n_genes=5)
print(f"Top 5 DE genes for perturbation '{group}':")
top_df[['names', 'logfoldchanges', 'pvals_adj']]

Top 5 DE genes for perturbation 'AMIGO3':
[16]:
names logfoldchanges pvals_adj
0 SET -0.506634 8.419143e-24
1 MT-CO2 0.402062 8.419143e-24
2 RPS27A 0.267033 1.022991e-22
3 EIF1 0.288907 1.022991e-22
4 MT-CYB 0.327073 1.022991e-22

The cx.AnnData handles close themselves when their Python objects go out of scope, so no explicit cleanup is required.

NB-GLM Differential Expression

For more sophisticated differential expression analysis, crispyx provides a negative binomial GLM method that models count data directly. This is especially useful when you need to incorporate covariates or want more accurate statistical modeling.

Note: NB-GLM operates on raw counts (not log-normalized data), so we use the QC-filtered dataset directly.

[17]:
# Run NB-GLM test on raw counts
# First, reload the QC-filtered dataset (not log-normalized)
qc_path = OUTPUT_DIR / 'crispyx_tutorial_filtered.h5ad'
adata_qc = cx.read_h5ad_ondisk(qc_path)

adata_nb = cx.tl.rank_genes_groups(
    adata_qc,
    perturbation_column='perturbation',
    method='nb_glm',
    output_dir=OUTPUT_DIR,
    data_name='tutorial_nb',
)
group = adata_nb.obs.load()['perturbation'].astype(str).unique()[0]
cx.pl.rank_genes_groups_df(adata_nb, group=group, n_genes=5).head()

AnnData object with n_obs × n_vars = 1716 × 9238 backed at '/Users/dujinhong/Library/CloudStorage/OneDrive-TheUniversityOfHongKong/Streamlining-CRISPR-Screen-Analysis/Streamlining-CRISPR-Screen-Analysis/docs/tutorial_outputs/crispyx_tutorial_filtered.h5ad'
    obs: 'perturbation', 'n_genes', 'dataset', 'celltype'
    var: 'Ensembl_ID', 'gene_symbols'
First obs rows:
                 perturbation  n_genes  dataset celltype
Cell_barcodes
AAACATACAAGATG-1      control     2914  Adamson     K562
AAACATTGCATTGG-1      control     3999  Adamson     K562
AAACGGCTAATGCC-1       AMIGO3     3947  Adamson     K562
AAAGACGAACTCAG-1      control     2713  Adamson     K562
AAAGAGACTCGCCT-1      control     3678  Adamson     K562
First var rows:
                  Ensembl_ID gene_symbols
Gene_symbol
NOC2L        ENSG00000188976        NOC2L
KLHL17       ENSG00000187961       KLHL17
HES4         ENSG00000188290         HES4
ISG15        ENSG00000187608        ISG15
AGRN         ENSG00000188157         AGRN
[17]:
names scores logfoldchanges pvals pvals_adj pts pts_rest group
0 MT-CO2 10.753535 0.447715 5.703282e-27 5.268692e-23 0.998667 0.998 AMIGO3
1 HIST1H1E -10.207953 -0.838292 1.826781e-24 8.437901e-21 0.512000 0.738 AMIGO3
2 MT-CYB 10.039458 0.402310 1.022347e-23 3.148146e-20 0.998667 0.998 AMIGO3
3 HIST1H1D -9.473744 -0.889809 2.699881e-21 5.309861e-18 0.436000 0.656 AMIGO3
4 UQCRB 9.467219 0.537892 2.873923e-21 5.309861e-18 0.952000 0.918 AMIGO3

LFC Shrinkage with apeGLM

Apply adaptive shrinkage to log-fold changes using the apeGLM method. This shrinks noisy/uncertain LFCs toward zero while preserving large effects, following DESeq2/PyDESeq2 best practices.

The two-step workflow (NB-GLM → shrink_lfc) gives you control over when shrinkage is applied and allows you to compare shrunk vs unshrunk results.

[18]:
# Apply LFC shrinkage to NB-GLM results
nb_result_path = adata_nb.path

shrunk = cx.tl.shrink_lfc(
    nb_result_path,
    prior_scale_mode='global',  # Use global prior across all perturbations
)

# Compare shrunk vs raw LFC for a perturbation
import anndata
shrunk_adata = anndata.read_h5ad(nb_result_path)
print("Shrunk vs Raw LFC (first 5 genes, first perturbation):")
print(f"  Shrunk LFC: {shrunk_adata.X[0, :5]}")
print(f"  Raw LFC:    {shrunk_adata.layers['logfoldchange_raw'][0, :5]}")

Shrunk vs Raw LFC (first 5 genes, first perturbation):
  Shrunk LFC: [ 0.00091058 -0.02613842  0.0247642   0.0387339  -0.07999716]
  Raw LFC:    [ 0.00091058 -0.02613842  0.0247642   0.0387339  -0.07999716]

The shrunk LFC values are typically smaller in magnitude for genes with high uncertainty, while large, well-supported effects are preserved. This improves downstream ranking and visualization.

Plotting (Scanpy-style)

crispyx provides cx.pl helpers for plotting on-disk results. The examples below mirror Scanpy style while keeping the expression matrix on disk.

[19]:
# QC plots
cx.pl.qc_perturbation_counts(
    data=adata_qc,
    perturbation_column='perturbation',
    cell_mask=qc_result.cell_mask,
)
cx.pl.qc_summary(qc_result, min_genes=100, min_cells_per_gene=10)

# Rank genes groups plot
cx.pl.rank_genes_groups(adata_ro, n_genes=20, sharey=False)

# Volcano / top genes
plot_group = adata_ro.obs.load()['perturbation'].astype(str).unique()[0]
df = cx.pl.rank_genes_groups_df(adata_ro, group=plot_group, n_genes=200)
cx.pl.volcano(de_df=df, group=plot_group)
cx.pl.top_genes_bar(de_df=df, group=plot_group, topn=15)

# MA plot (raw counts from QC-filtered data)
try:
    control_label = adata_ro.uns['control_label'].load()
except KeyError:
    control_label = 'control'
cx.pl.ma(
    data=adata_qc,
    de_result=adata_ro,
    group=plot_group,
    reference=control_label,
    perturbation_column='perturbation',
    mean_mode='raw',
)

_images/crispyx_tutorial_36_0.png
_images/crispyx_tutorial_36_1.png
_images/crispyx_tutorial_36_2.png
[19]:
<Axes: title={'center': 'MA plot: AMIGO3 vs control'}, xlabel='log1p(mean expression)', ylabel='log2FC'>
_images/crispyx_tutorial_36_4.png
_images/crispyx_tutorial_36_5.png
_images/crispyx_tutorial_36_6.png