FAQ & Troubleshooting

Common issues

MemoryError or out-of-memory kills

Set memory_limit_gb to match your available RAM:

result = cx.tl.rank_genes_groups(
    adata,
    perturbation_column="perturbation",
    method="nb_glm",
    memory_limit_gb=32,  # set to your SLURM --mem value
)

For very large datasets, consider:

  • Converting to CSC before Wilcoxon (see Usage Guide).

  • Converting to CSR before NB-GLM.

  • Using freeze_control=True for datasets with >100K control cells.

When should I use CSC vs CSR format?

  • CSC (Compressed Sparse Column): Use for Wilcoxon rank-sum tests, which iterate over gene columns. Convert with crispyx.convert_to_csc().

  • CSR (Compressed Sparse Row): Use for NB-GLM, size factors, and operations that iterate over cells (rows). Convert with crispyx.convert_to_csr().

The benchmark pipeline handles this automatically, but manual workflows should convert before calling DE functions.

tomllib / tomli import errors

If building docs on Python 3.10, install the backport:

pip install tomli

Python 3.11+ includes tomllib in the standard library.

Control label not detected

crispyx auto-detects control labels (ctrl, NTC, scramble, etc.). If your dataset uses a non-standard label, pass it explicitly:

adata = cx.pp.qc_summary(
    adata,
    perturbation_column="perturbation",
    control_label="my_control_name",
)

Or use crispyx.normalise_perturbation_labels() to canonicalise labels before analysis.

UserWarning: CSC storage detected during NB-GLM

NB-GLM requires CSR format. Convert first:

adata_csr = cx.pp.convert_to_csr(adata, output_dir="results/")
result = cx.nb_glm_test(adata_csr, perturbation_column="perturbation")

HPC / SLURM tips

  • Set memory_limit_gb to your SLURM --mem allocation.

  • Use resume=True and checkpoint_interval=10 for long jobs that may be preempted.

  • drop_file_cache() is called automatically to prevent cgroup-cached pages from counting toward memory limits.

  • See benchmarking/singularity/ for SLURM submission scripts.

My DE result is loaded instantly on the second call — is that expected?

Yes. Since v0.0.3 all three DE functions auto-reload an existing result file instead of rerunning the analysis. When verbose=True a notice is printed:

[crispyx] Loading existing result: data/crispyx_wilcoxon.h5ad
[crispyx] Pass force=True to rerun the analysis.

If you changed a parameter (e.g. min_pct_ctrl, min_pct_pert, a covariate list, or dispersion_scope) and want the result to reflect the new settings, pass force=True to the DE function. The existing output file will be overwritten.

Can I pickle / serialise a RankGenesGroupsResult?

Yes, since v0.0.3. The RecursionError that occurred when calling pickle.dumps on a result object is fixed. The on-disk HDF5 handle is excluded from the pickle payload and reopened lazily after unpickling:

import pickle
result = cx.wilcoxon_test("data.h5ad", perturbation_column="perturbation")

data = pickle.dumps(result)        # no RecursionError
restored = pickle.loads(data)      # works
# restored.result is None — no open file handle after unpickling.
# Access restored["KO1"].pvalue etc. normally.

Note that restored.result is None after unpickling. If you need the backed AnnData reference (e.g. to call result.result_path), re-open it:

from crispyx.data import AnnData
restored.result = AnnData(original_output_path)

Performance tips

  • Pre-convert matrix formats before DE: CSC for Wilcoxon, CSR for NB-GLM. This avoids O(total_nnz × n_chunks) scans.

  • Use ``freeze_control=True`` for datasets with >100K control cells to reduce per-worker memory from ~32 GB to <1 GB.

  • Increase ``n_jobs`` for multi-core NB-GLM on machines with sufficient RAM.

  • Use adaptive chunk sizes (the default): let crispyx calculate optimal chunk sizes based on your memory_limit_gb.

Comparison questions

When should I use crispyx instead of Scanpy?

Use crispyx when your dataset does not fit in RAM, when you are running on an HPC system with a memory limit, or when you want a streaming on-disk pipeline. crispyx produces results identical to Scanpy for t-test and Wilcoxon DE (Pearson r > 0.9999). For datasets that fit in RAM and where you need Scanpy’s broader ecosystem, use Scanpy.

Can I use crispyx instead of Pertpy or PyDESeq2 for NB-GLM?

Yes. crispyx implements a negative binomial GLM that is approximately 2× faster than Pertpy/PyDESeq2 and uses far less memory on genome-wide datasets. Results agree with PyDESeq2 (Pearson r > 0.97 for LFC estimates). crispyx does not implement the full PyDESeq2 feature set (custom design matrices, Cook’s outlier filtering, etc.). For large genome-wide screens where PyDESeq2 runs out of memory, crispyx is currently the only practical Python option.

Does crispyx replace the full Pertpy workflow?

No. crispyx focuses on QC, normalization, pseudobulk, and differential expression for CRISPR screens. Pertpy provides many additional perturbation analysis methods (Augur, Mixscape, CINEMA-OT, etc.) that are outside the scope of crispyx. For large screens, you can use crispyx for the memory-intensive DE steps and Pertpy for downstream perturbation-specific analyses.

See Comparison with other tools for a full side-by-side comparison.