crispyx

Genome-wide CRISPR screens routinely produce datasets with hundreds of thousands of cells and tens of thousands of genes. Standard single-cell toolkits load the entire count matrix into memory, which can require 30–100+ GB of RAM. crispyx streams data directly from on-disk AnnData .h5ad files so that quality control, normalisation, pseudo-bulk aggregation, and differential expression all run without materialising the full matrix — even the largest screens can be processed with modest resources.

The API mirrors Scanpy (cx.pp, cx.pb, cx.tl, cx.pl) so existing workflows can migrate with minimal changes. See the tutorial for an end-to-end walkthrough.

Key features

Streaming QC & preprocessing — filter and normalise without loading the full matrix
Pseudo-bulk aggregation — average log expression and pseudo-bulk count matrices
Differential expression — t-test, Wilcoxon, NB-GLM with apeGLM LFC shrinkage
Dimension reduction — memory-efficient PCA and KNN on backed data
Scanpy-compatible API — familiar namespaces and plotting helpers
HPC-ready — resume/checkpoint, configurable memory limits, Docker and Singularity

Reference

Development