Benchmarking
The benchmarking toolkit automates comparisons between the streaming
implementations in this project. Supply your own data (or generate a synthetic
demo dataset with python benchmarking/generate_demo_dataset.py) and run the
benchmark suite against any compatible .h5ad file that exposes
perturbation and gene_symbols columns.
Running benchmarks
cd benchmarking
./run_benchmark.sh config/Adamson.yaml # single dataset
./run_benchmark.sh config/*.yaml # all datasets
Outputs
Resource usage measurements stored in
benchmarking/results/benchmark_results.csv.A GitHub-friendly table at
benchmarking/results/benchmark_results.md.Intermediate AnnData files written directly to
benchmarking/results(or any directory provided via--output-dir).Comprehensive differential expression parity metrics: max absolute differences, Pearson/Spearman correlations, top-
koverlaps (k=50by default), and AUROC values when ground-truth labels are present. These are captured in both the CSV and Markdown summaries so ranking agreement is easy to audit alongside effect size deviations.
The script accepts additional options to benchmark a subset of methods or to
redirect outputs to a different directory. Refer to benchmarking/README.md
for further details.
Available benchmark methods
crispyx: crispyx_qc_filtered, crispyx_preprocess,
crispyx_pb_avg_log, crispyx_pb_pseudobulk, crispyx_de_t_test,
crispyx_de_wilcoxon, crispyx_de_nb_glm
Reference: scanpy_qc_filtered, scanpy_de_t_test,
scanpy_de_wilcoxon, edger_de_glm, pertpy_de_pydeseq2
The crispyx_preprocess step normalizes the QC-filtered output with streaming
normalize_total_log1p() and is a prerequisite for all t-test and Wilcoxon DE
methods (both crispyx and Scanpy). The execution order is enforced via
depends_on: QC → preprocess → DE.
Available NB-GLM benchmark methods
The benchmark suite includes the following NB-GLM variants:
crispyx_de_nb_glm: Standard NB-GLM differential expressioncrispyx_de_nb_glm_shrunk: NB-GLM with apeGLM LFC shrinkage (recommended)
The shrinkage variant applies adaptive Cauchy prior shrinkage to log-fold changes, which improves accuracy by preserving large effects while shrinking uncertain estimates toward zero.
Results summary
See benchmarking/benchmark_summary.md for a human-readable summary of performance and accuracy results across all 12 benchmark datasets, including runtime tables, memory tables, and accuracy metrics.
For comparison of crispyx against Scanpy, Pertpy/PyDESeq2, and edgeR, see Comparison with other tools.