Usage Guide
crispyx provides a Scanpy-style API for streaming CRISPR screen analysis.
Each operation reads data from disk so large .h5ad files can be processed
on commodity hardware without loading the full count matrix into memory.
The typical workflow is:
Load – open a dataset on disk
QC – filter cells, perturbations, and genes
Preprocess – normalise and log-transform (streaming)
Dimension reduction – PCA and KNN graph construction
Pseudo-bulk – aggregate per perturbation
Differential expression – t-test, Wilcoxon, or NB-GLM
Plot – visualise results with Scanpy-style helpers
Quick start
import crispyx as cx
adata_ro = cx.read_h5ad_ondisk("data/demo_benchmark.h5ad")
adata_ro = cx.pp.qc_summary(
adata_ro,
perturbation_column="perturbation",
min_genes=100,
min_cells_per_perturbation=15,
min_cells_per_gene=10,
)
adata_pb = cx.pb.average_log_expression(
adata_ro,
perturbation_column="perturbation",
)
adata_ro = cx.tl.rank_genes_groups(
adata_ro,
perturbation_column="perturbation",
method="wilcoxon",
)
print(adata_ro.uns["rank_genes_groups"]) # preview without loading everything
de_results = adata_ro.uns["rank_genes_groups"].load()
de_full = de_results["full"]
var_table = adata_ro.var.load()
Setting up
Install the project in editable mode with optional dependencies:
pip install -e .[test]
Loading data
Use crispyx.read_h5ad_ondisk() to open a dataset without
materialising the expression matrix and print the first few rows of metadata.
A synthetic demo dataset can be generated via
python benchmarking/generate_demo_dataset.py and stored at
data/demo_benchmark.h5ad. The returned cx.AnnData object keeps a
backed AnnData handle alive lazily and automatically closes it when the wrapper
is garbage collected. Explicitly call adata.close() to release the file as
soon as you finish the preview.
Quality control
Call crispyx.pp.qc_summary() to filter cells, perturbations,
and genes. The function writes a filtered AnnData file to disk and returns a
new cx.AnnData view pointing at the result so the next step can reuse
the same handle without reopening the path. When control_label is omitted,
perturbations containing strings such as ctrl or nontarget are chosen
automatically and logged for reproducibility. Likewise, omitting
gene_name_column falls back to adata.var_names with a helpful message.
Individual helpers such as crispyx.pp.filter_cells() are also
available for customised pipelines.
Effect size estimation
Two complementary estimators are exposed through crispyx.pb:
crispyx.pb.average_log_expression()crispyx.pb.pseudobulk()
Each operates on the filtered dataset and produces an AnnData artifact
containing the effect sizes per perturbation. The control label inference and
gene name fallbacks described above apply here as well, so these functions can
operate with minimal boilerplate on well-annotated datasets. Passing a
cx.AnnData instance from the QC step avoids reopening the file path
manually, and the returned wrappers expose .obs/.var tables with
.load() helpers for materialising the full metadata only when requested.
Dimension Reduction
For visualization and clustering, CRISPYx provides streaming PCA and KNN graph construction that works with on-disk data:
# Streaming PCA (auto-selects optimal method based on gene count)
cx.pp.pca(adata_norm, n_comps=50)
# Build KNN graph from PCA embeddings
cx.pp.neighbors(adata_norm, n_neighbors=15)
The PCA implementation uses a hybrid approach:
Sparse covariance (
method='sparse_cov'): ~5× faster for datasets with ≤15K genes. Exploits sparsity in the Xᵀ @ X computation.IncrementalPCA (
method='incremental'): Lower memory for datasets with >15K genes. Uses sklearn’s streaming PCA with partial_fit().Automatic selection (
method='auto', default): Chooses the optimal method based on gene count and available memory.
PCA results are stored in:
adata.obsm['X_pca']: Cell embeddings (n_cells × n_comps)adata.varm['PCs']: Gene loadings (n_genes × n_comps)adata.uns['pca']: Variance info and method metadata
KNN results are stored in:
adata.obsp['distances']: Sparse distance matrixadata.obsp['connectivities']: Sparse connectivity matrix (UMAP-style)adata.uns['neighbors']: Parameters dict
Close-Write-Reopen Pattern: When using cx.read_h5ad_ondisk() to load
backed data, PCA and neighbors results are written directly to the h5ad file.
This keeps .X on disk while persisting embeddings, loadings, and neighbor
graphs for later use. No copy=True is needed in typical workflows.
CSC preprocessing for Wilcoxon
For large datasets, convert the preprocessed CSR file to CSC format before
running Wilcoxon DE. CSR storage forces a full scan of all data and
indices arrays for each gene chunk (O(total_nnz) per chunk); CSC storage
makes each chunk access O(nnz_in_chunk), so total I/O drops from
n_chunks × file_size to file_size. This gives approximately 18×
speedup on large screens (Feng-gwsf: 3.35 h → ~11 min):
# Convert normalized CSR file to CSC (streaming, no full-matrix load)
adata_csc = cx.pp.convert_to_csc(
adata_norm,
output_dir="results/",
)
# adata_csc is returned immediately and unchanged if input is already CSC.
# Use adata_csc as input to rank_genes_groups() for fast Wilcoxon.
The function auto-detects whether the source file is already CSC and returns
it unchanged with no I/O. In the benchmark pipeline, CSC conversion is
bundled inside crispyx_de_wilcoxon (via run_wilcoxon_with_csc) so
that the reported wall-time includes both conversion and DE — giving a single
honest total cost rather than a split accounting that would make Wilcoxon
appear faster than it is. Benchmark results include sub-columns
csc_conversion_seconds, wilcoxon_seconds, and was_already_csc
for fine-grained breakdown.
Note
For small datasets (files < ~12.5 GB), the CSC conversion overhead is negligible (< 1 s on fast NVMe) and the total wilcoxon time is dominated by process startup (2–3 s). For large screens (Feng-gwsf 15 GB, Feng-gwsnf 27 GB) the CSC conversion itself takes ~60–120 s but eliminates the ~18× repeated full-file scans that CSR imposes.
CSR preprocessing for NB-GLM
NB-GLM operations (size factors, control matrix loading, per-perturbation slicing) are all row-wise. CSC or dense storage makes each row access O(total_nnz) per slice, causing severe slowdowns or hangs. Convert to CSR before running NB-GLM:
# Convert standardised file to CSR (streaming, no full-matrix load)
adata_csr = cx.pp.convert_to_csr(
adata,
output_dir="results/",
)
# Returns immediately if already CSR.
# NB-GLM on CSR file
result = cx.nb_glm_test(
adata_csr,
perturbation_column="perturbation",
)
The function uses format-aware streaming: CSC sources are read in
column-chunks (axis=1) and scattered into CSR buffers; dense sources are
read in row-chunks (axis=0). For large datasets, the streaming control
statistics function automatically activates — fitting the intercept-only
model in chunks of 4,096 control cells instead of densifying the full
control matrix. This keeps peak memory at O(chunk_size × n_genes) rather
than O(n_control × n_genes). When freeze_control is auto-enabled,
streaming is used unconditionally. DESeq2-style size factors
(size_factor_method="deseq2") also stream automatically when the
intermediate counts array would exceed 4 GB, computing geometric means
and per-cell median ratios in chunks. After each streaming phase,
drop_file_cache() evicts file data from the kernel page cache so that
cgroup-limited environments (e.g. SLURM) do not count cached pages toward
the memory limit.
Note
If nb_glm_test() detects CSC storage, it emits a UserWarning
advising conversion to CSR. In the benchmark pipeline, CSR conversion
is handled automatically via the crispyx_standardize_csr step.
Differential expression
Invoke crispyx.tl.rank_genes_groups() to compare perturbations
against the control population while matching the familiar
scanpy.tl.rank_genes_groups() interface. Choose method="wilcoxon" (the
default) for a Mann-Whitney U test, method="wald" for the streaming Wald
test, or method="nb_glm" to fit the negative binomial GLM that supports
covariates. The helper reuses the automatic control inference so a missing
control_label triggers the same adaptive search used earlier, and the
returned cx.AnnData wrapper stores previews of the results in
.uns so printing adata.uns["rank_genes_groups"] shows the top genes
per perturbation while .load() retrieves the full tables on demand.
Scanpy-compatible parameter aliases
All three DE functions (crispyx.tl.t_test(), crispyx.tl.wilcoxon_test(),
crispyx.tl.nb_glm_test()) and crispyx.tl.rank_genes_groups() accept
the Scanpy-style aliases groupby (for perturbation_column) and
reference (for control_label), making it easier to port code from
Scanpy workflows:
# Scanpy-style call (aliases)
result = cx.tl.rank_genes_groups(
adata_ro,
groupby="perturbation", # alias for perturbation_column
reference="ctrl", # alias for control_label
method="wilcoxon",
)
# Equivalent canonical call
result = cx.tl.rank_genes_groups(
adata_ro,
perturbation_column="perturbation",
control_label="ctrl",
method="wilcoxon",
)
The canonical names (perturbation_column and control_label) remain the
primary names and are not deprecated. Supplying both a canonical name and its
alias at the same time raises TypeError.
crispyx provides Scanpy-style plotting helpers under cx.pl that work with
on-disk results. The plotting functions materialise only the metadata needed
for plotting, keeping the expression matrix on disk.
PCA Visualization
# Run PCA first (if not already done)
cx.pp.pca(adata_norm, n_comps=50)
# Plot variance explained per component
cx.pl.pca_variance_ratio(adata_norm, n_pcs=20)
# PCA scatter colored by perturbation
cx.pl.pca(adata_norm, color='perturbation', components='1,2')
# Gene loadings for top components
cx.pl.pca_loadings(adata_norm, components=[1, 2, 3])
Differential Expression Visualization
# Rank genes groups plot (Scanpy-style)
cx.pl.rank_genes_groups(adata_de, n_genes=20, sharey=False)
# Convert DE results into a tidy DataFrame for custom plots
df = cx.pl.rank_genes_groups_df(adata_de, group="perturbation_A", n_genes=200)
# Volcano and top-genes plots
cx.pl.volcano(de_df=df, group="perturbation_A")
cx.pl.top_genes_bar(de_df=df, group="perturbation_A", topn=15)
# MA plot using raw counts or normalized log1p means
cx.pl.ma(
data=adata_ro, # raw counts
de_result=adata_de,
group="perturbation_A",
reference="control",
perturbation_column="perturbation",
mean_mode="raw", # or "log1p"
)
# QC plotting (composition + summary distributions)
qc = cx.pp.qc_summary(
adata_ro,
perturbation_column="perturbation",
min_genes=100,
min_cells_per_perturbation=15,
min_cells_per_gene=10,
)
cx.pl.qc_perturbation_counts(
data=adata_ro,
perturbation_column="perturbation",
cell_mask=qc.cell_mask,
)
cx.pl.qc_summary(qc, min_genes=100, min_cells_per_gene=10)
NB-GLM options
The negative binomial GLM (method="nb_glm") supports several options:
Adaptive chunk size (default): When
chunk_size=None(the default), crispyx automatically calculates an optimal chunk size based on dataset dimensions andmemory_limit_gb. Small/medium datasets use the maximum chunk size (256) for speed, while large memory-constrained datasets use smaller chunks to avoid OOM errors. You can still setchunk_sizeexplicitly to override automatic selection.Frozen control mode (
freeze_control): For datasets with large control populations (>100K cells), the control matrix can consume 30+ GB of memory. Whenfreeze_control=True, control statistics are pre-computed once and shared across workers via memory-mapped files, reducing per-worker memory from ~32 GB to <1 GB. This enables full parallelization on large datasets.Auto-detection (default): When
freeze_control=None, crispyx automatically enables frozen control mode when:Control matrix exceeds 10 GB (control_n × n_genes × 8 bytes > 10 GB)
Standard mode would limit parallelization to <4 workers
This means large datasets like Feng (110K control cells) automatically use frozen control mode without user intervention, while smaller datasets maintain full flexibility.
LFC shrinkage: For improved accuracy, apply adaptive Cauchy prior shrinkage to log-fold changes using
shrink_lfc()after runningnb_glm_test(). This preserves large effects while shrinking small/uncertain effects toward zero.Dispersion sharing (
share_dispersion=True): Estimate dispersion once using all cells (similar to PyDESeq2’s approach). This provides more stable estimates when sample sizes are small or when you expect homogeneous dispersion across perturbations.Memory limit (
memory_limit_gb): Specify the maximum memory available for the analysis. Fornb_glm_test(), this affects chunk size calculation and worker count estimation. Forwilcoxon_test(), it controls whether the streaming path is used for large datasets (>30% of budget triggers streaming). Fort_test(), it controls automatic cell chunk size calculation. Forshrink_lfc(), it limits parallel workers inmethod="full". For HPC environments with fixed allocations, set this to your SLURM--memvalue (e.g.,memory_limit_gb=128).Scanpy format (
scanpy_format=True): Write Scanpy-compatibleuns["rank_genes_groups"]structure for interoperability withsc.get.rank_genes_groups_df()and similar utilities. This option is available fort_test(),wilcoxon_test(), andnb_glm_test(). Default isFalsefor performance.
# Basic NB-GLM (faster, per-perturbation dispersion)
adata_de = cx.tl.rank_genes_groups(
adata_ro,
perturbation_column="perturbation",
method="nb_glm",
)
# NB-GLM with shared dispersion
adata_de = cx.tl.rank_genes_groups(
adata_ro,
perturbation_column="perturbation",
method="nb_glm",
share_dispersion=True,
)
LFC shrinkage
For more accurate log-fold change estimates, apply apeGLM shrinkage after running the NB-GLM test. This two-step workflow matches DESeq2/PyDESeq2 best practices:
# Step 1: Run NB-GLM test
result = cx.nb_glm_test(
adata_ro,
perturbation_column="perturbation",
)
# Step 2: Apply LFC shrinkage
shrunk = cx.shrink_lfc(
result.result_path,
prior_scale_mode="global", # or "per_comparison"
)
The prior_scale_mode parameter controls how the shrinkage prior is estimated:
"global"(default): Estimate a single prior scale from all comparisons. Recommended for CRISPR screens with many perturbations."per_comparison": Estimate prior scale separately for each perturbation. May be more accurate when effect sizes vary substantially across perturbations.
You can also use cx.tl.shrink_lfc() for API consistency with other tools:
shrunk = cx.tl.shrink_lfc(
result.result_path,
prior_scale_mode="global",
)
Auto-reload and re-run control
All three DE functions (wilcoxon_test, t_test, nb_glm_test) write
their results to an .h5ad file and remember the output path. On a
subsequent call with the same parameters, if the output file already exists
the function loads and returns the saved result instead of rerunning the
analysis — making notebook and script reruns instant:
# First call — runs the full analysis and writes crispyx_wilcoxon.h5ad
result = cx.wilcoxon_test(
"data.h5ad",
perturbation_column="perturbation",
verbose=True,
)
# Second call — reloads from disk in milliseconds
result = cx.wilcoxon_test(
"data.h5ad",
perturbation_column="perturbation",
verbose=True,
)
# [crispyx] Loading existing result: data/crispyx_wilcoxon.h5ad
# [crispyx] Pass force=True to rerun the analysis.
To rerun unconditionally (e.g. after changing min_pct_ctrl,
memory_limit_gb, or any other parameter), pass force=True:
result = cx.wilcoxon_test(
"data.h5ad",
perturbation_column="perturbation",
min_pct_ctrl=0.05,
min_pct_pert=0.05,
force=True, # overwrite the existing file
)
Note
The auto-reload check is based purely on file existence, not on a
comparison of the parameters used to produce it. Always pass force=True
when you intentionally run with different settings.
Resume and checkpointing
Long-running differential expression analyses can be resumed after interruption
using the resume and checkpoint_interval parameters:
# Enable checkpointing during long runs
result = cx.nb_glm_test(
adata_ro,
perturbation_column="perturbation",
resume=True,
checkpoint_interval=10, # Save progress every 10 perturbations
)
If interrupted, simply re-run the same command - completed perturbations will
be skipped automatically. The checkpoint file <output>_progress.json is
written atomically to prevent corruption.
Benchmarking
The benchmarking directory ships with a reusable script that measures time
and memory usage across the main methods:
cd benchmarking
./run_benchmark.sh config/Adamson.yaml
See Benchmarking for configuration options and output structure.
Data Preparation Utilities
crispyx 0.7.5 adds five utility families for cleaning heterogeneous datasets before running QC or differential expression.
Editing backed metadata without loading X
Load only the obs or var table from a backed file, edit it in Python,
and write it back — without ever reading the expression matrix:
# Load obs metadata (X is never read)
obs = cx.load_obs("data/counts.h5ad")
obs["batch"] = obs["batch"].str.upper()
# Write back (must have the same number of rows)
cx.write_obs("data/counts.h5ad", obs)
# Same for var
var = cx.load_var("data/counts.h5ad")
var["gene_symbols"] = var["gene_symbols"].str.upper()
cx.write_var("data/counts.h5ad", var)
Gene name standardisation
Normalise Ensembl version suffixes, mitochondrial prefixes, and optionally
map IDs to HGNC symbols via mygene (pip install mygene):
# Strip version suffix + mt normalisation (in-place)
cx.standardise_gene_names(
"data/counts.h5ad",
column="ensembl_id", # var column; None uses var_names
strip_version=True, # ENSG00000123.4 → ENSG00000123
normalise_mt_prefix=True, # mt-ND1 → MT-ND1
)
# Online Ensembl → symbol lookup (returns Series without modifying file)
symbols = cx.standardise_gene_names(
"data/counts.h5ad",
column="ensembl_id",
lookup_symbols=True,
species="human",
unmapped_action="warn",
inplace=False,
)
Perturbation label normalisation
Strip guide prefixes/suffixes and unify diverse control labels:
cx.normalise_perturbation_labels(
"data/counts.h5ad",
column="perturbation",
strip_prefixes=["sg-", "sg"],
strip_suffixes=["_KO", "_KD", "_P1P2"],
canonical_control="NTC", # maps ctrl/scramble/NTC/non-targeting → NTC
)
# Or return normalised labels without writing
labels = cx.normalise_perturbation_labels(
"data/counts.h5ad",
column="perturbation",
inplace=False,
)
Auto-detecting metadata columns
Let crispyx infer which obs/var columns hold perturbation labels and gene symbols:
# Detect individually
pert_col = cx.detect_perturbation_column("data/counts.h5ad")
gene_col = cx.detect_gene_symbol_column("data/counts.h5ad")
# Or in one call
cols = cx.infer_columns("data/counts.h5ad")
print(cols)
# {"perturbation_column": "perturbation", "gene_name_column": "gene_symbols"}
# Pass detected columns directly to downstream functions
adata = cx.tl.rank_genes_groups(
"data/counts.h5ad",
perturbation_column=cols["perturbation_column"],
gene_name_column=cols["gene_name_column"],
method="wilcoxon",
)
Overlap analysis
Compare sets of genes or perturbations across datasets:
result = cx.tl.compute_overlap({
"Adamson": set(adamson_genes),
"Replogle": set(replogle_genes),
"Nadig": set(nadig_genes),
})
print(result.jaccard_matrix)
print(result.count_matrix)
print(result.set_sizes)
# Plot as a heatmap
ax = cx.pl.overlap_heatmap(result, metric="jaccard", cmap="Blues")
ax = cx.pl.overlap_heatmap(result, metric="count", annot=True, fmt="d")