API Reference 

compute_overlap(sets_dict, *, metric='both')[source]: Compute pairwise overlap statistics. See crispyx.compute_overlap().

Plotting (`cx.pl`)

class crispyx._namespaces._PlottingNamespace[source]

Scanpy-style plotting entry points (cx.pl).

rank_genes_groups(data, **kwargs)[source]

rank_genes_groups_df(data, group, **kwargs)[source]

volcano(**kwargs)[source]

ma(**kwargs)[source]

top_genes_bar(**kwargs)[source]

qc_perturbation_counts(**kwargs)[source]

qc_summary(qc_result, **kwargs)[source]

materialize_rank_genes_groups(data, **kwargs)[source]

pca(data, **kwargs)[source]: Plot PCA scatter.

pca_variance_ratio(data, **kwargs)[source]: Plot PCA variance ratio.

pca_loadings(data, **kwargs)[source]: Plot PCA loadings.

umap(data, **kwargs)[source]: Plot UMAP embedding.

overlap_heatmap(result, **kwargs)[source]: Plot pairwise overlap heatmap. See crispyx.plot_overlap_heatmap().

Module Reference

The sections below document the underlying modules used by the namespace API. These are useful for advanced usage or for understanding parameter details.

Data loading and utilities

Helpers for working with AnnData .h5ad files in a streaming friendly way.

crispyx.data.drop_file_cache(path)[source]

Advise the kernel to drop page cache for path (Linux only).

On cgroup-limited systems (SLURM), page-cache pages count toward the memory limit. Calling this after a streaming read prevents the cached file data from consuming the cgroup budget.

The call is a no-op on non-Linux platforms or when the file cannot be opened.

Parameters:: path (str | Path)
Return type:: None

crispyx.data.is_dense_storage(path)[source]

Check if h5ad file stores X matrix as dense array.

Parameters:: path (str | Path) – Path to h5ad file.
Returns:: True if X is stored as dense array, False if sparse (CSR/CSC).
Return type:: bool

crispyx.data.get_matrix_storage_format(path)[source]

Detect matrix storage format in h5ad file.

Parameters:: path (str | Path) – Path to h5ad file.
Returns:: One of: ‘csr’, ‘csc’, ‘dense’
Return type:: str

class crispyx.data.AnnData(path, *, mode='r')[source]

Bases: object

Thin wrapper around a backed anndata.AnnData handle.

Parameters:

path (str | Path)
mode (str)

property filename: str: Return the underlying filename for compatibility with Scanpy.

property backed: AnnData: Return the lazily opened backed AnnData handle.

close()[source]

Close the underlying file handle if it is open.

Return type:: None

to_memory()[source]

Materialise the backed AnnData into memory.

Return type:: AnnData

property obs: _LazyFrameAccessor: Lazy accessor for observation (cell) metadata.

property var: _LazyFrameAccessor: Lazy accessor for variable (gene) metadata.

property uns: _LazyUnsMapping: Lazy accessor for unstructured annotations.

crispyx.data.read_backed(path)[source]

Open an .h5ad file in backed mode for low-memory access.

Parameters:: path (str | Path)
Return type:: AnnData

crispyx.data.write_obsm_to_h5ad(path, key, data)[source]

Write a dense array to obsm/{key} in an h5ad file.

Parameters:

path (str | Path) – Path to the h5ad file.
key (str) – Key under obsm (e.g., ‘X_pca’).
data (ndarray) – Dense numpy array of shape (n_obs, n_dims).

Return type:

None

crispyx.data.write_varm_to_h5ad(path, key, data)[source]

Write a dense array to varm/{key} in an h5ad file.

Parameters:

path (str | Path) – Path to the h5ad file.
key (str) – Key under varm (e.g., ‘PCs’).
data (ndarray) – Dense numpy array of shape (n_vars, n_dims).

Return type:

None

crispyx.data.write_uns_dict_to_h5ad(path, key, data)[source]

Write a dict to uns/{key} in an h5ad file.

Handles scalar values, numpy arrays, and nested dicts. Uses AnnData-compatible encoding for proper round-trip compatibility.

Parameters:

path (str | Path) – Path to the h5ad file.
key (str) – Key under uns (e.g., ‘pca’).
data (dict) – Dictionary with string keys and scalar/array values.

Return type:

None

crispyx.data.write_obsp_to_h5ad(path, key, data)[source]

Write a sparse matrix to obsp/{key} in an h5ad file.

Stores in CSR format following AnnData conventions.

Parameters:

path (str | Path) – Path to the h5ad file.
key (str) – Key under obsp (e.g., ‘distances’, ‘connectivities’).
data (spmatrix) – Sparse matrix of shape (n_obs, n_obs).

Return type:

None

crispyx.data.resolve_data_path(data, *, require_exists=True)[source]

Resolve the on-disk path for a backed AnnData object or path-like input.

This utility supports flexible input types for crispyx functions, allowing users to pass either a file path or an AnnData object.

Parameters:

data (str | Path | 'AnnData' | ad.AnnData) – One of: - A string or Path to an h5ad file - A crispyx.AnnData wrapper (has .path attribute) - A backed anndata.AnnData object (has .filename attribute)
require_exists (bool) – If True (default), verify the resolved path exists.

Returns:

The resolved file path.

Return type:

Path

Raises:

TypeError – If data is an in-memory (non-backed) AnnData or unsupported type.
FileNotFoundError – If require_exists is True and the path does not exist.

Examples

>>> from crispyx.data import resolve_data_path
>>> path = resolve_data_path("data/counts.h5ad")
>>> path = resolve_data_path(adata_wrapper)
>>> path = resolve_data_path(backed_adata)

crispyx.data.resolve_output_path(input_path, *, suffix, data_name=None, output_path=None, output_dir=None)[source]

Construct an informative output path for an intermediate .h5ad file.

Output names follow the pattern {stem}_cx_{suffix}.h5ad where stem is the input file’s stem (or data_name when provided). The _cx_ token sits between the input identity and the processing step so that:

filenames sort naturally by input first, then by operation;
the producing tool is unambiguously marked without cluttering the start of the filename.

When output_path is provided it is returned directly, bypassing all name computation. This lets callers enforce an exact output filename.

Deprecated since version ``output_dir``: is kept for backward compatibility and will be removed in the next major version. Use output_path instead.

Parameters:

input_path (str | Path)
suffix (str)
data_name (str | None)
output_path (str | Path | None)
output_dir (str | Path | None)

Return type:

Path

crispyx.data.ensure_gene_symbol_column(adata, gene_name_column)[source]

Return a vector of gene symbols and verify they look like symbols, not Ensembl IDs.

Parameters:

adata (ad.AnnData | ad._core.anndata.AnnDataMixin)
gene_name_column (str | None)

Return type:

pd.Index

crispyx.data.resolve_control_label(labels, control_label, *, verbose=True)[source]

Return an explicit control label, inferring one when necessary.

Parameters:

labels (Sequence[str])
control_label (str | None)
verbose (bool)

Return type:

str

crispyx.data.read_h5ad_ondisk(path, *, n_obs=5, n_vars=5)[source]

Open an .h5ad file on disk, print a preview, and return a read-only view.

Parameters:

path (str | Path)
n_obs (int)
n_vars (int)

Return type:

AnnData

crispyx.data.iter_matrix_chunks(adata, *, axis=0, chunk_size=1024, convert_to_dense=True)[source]

Yield chunks of the expression matrix.

Parameters:

adata (ad.AnnData | ad._core.anndata.AnnDataMixin)
axis (int)
chunk_size (int)
convert_to_dense (bool)

Return type:

Iterator[tuple[slice, np.ndarray | sp.spmatrix]]

crispyx.data.normalize_total_block(block, *, library_size=None, target_sum=10000.0, dtype=<class 'numpy.float64'>)[source]

Return a library-size normalised dense view of block.

Parameters:

block (ndarray | spmatrix) – A slice of the expression matrix with shape (n_cells, n_genes).
library_size (ndarray | None) – Optional precomputed library sizes for the cells in block. When None the library size is computed from block directly.
target_sum (float) – Target total counts per cell after normalisation, matching the default used by scanpy.pp.normalize_total().
dtype (dtype | type) – Data type for the output dense array. Defaults to float64.

Returns:

A tuple (normalised, library_size) where normalised is a dense array containing the normalised counts and library_size contains the per-cell library sizes that were used.

Return type:

tuple

crispyx.data.write_filtered_subset(source_path, *, cell_mask, gene_mask, output_path, chunk_size=4096, var_assignments=None, row_nnz=None, total_nnz=None, data_dtype=None, chunk_cache=None)[source]

Stream a filtered AnnData view to disk without materialising X.

Parameters:

source_path (str | Path) – Path to source h5ad file.
cell_mask (ndarray) – Boolean mask for cells to include.
gene_mask (ndarray) – Boolean mask for genes to include.
output_path (str | Path) – Path for output h5ad file.
chunk_size (int) – Number of cells to process per chunk.
var_assignments (dict[str, Sequence] | None) – Optional dict of column assignments for var DataFrame.
row_nnz (ndarray | None) – Optional pre-computed non-zero counts per row. When provided along with total_nnz and data_dtype, skips the counting pass.
total_nnz (int | None) – Optional pre-computed total non-zero count.
data_dtype (dtype | None) – Optional pre-computed data type for the sparse matrix.
chunk_cache (Any | None) – Optional _ChunkCache object from qc module. When provided, reads CSR data from cache instead of re-reading the source matrix.

Return type:

None

crispyx.data.normalize_total_log1p(data, output_path=None, *, normalize=True, log1p=True, target_sum=10000.0, chunk_size=4096, output_dir=None, data_name=None, format_mismatch_policy='warn', verbose=True)[source]

Stream normalize and/or log-transform an h5ad file without loading it fully into memory.

This function processes the source file in chunks, optionally applying: 1. Total-count normalization (scanpy.pp.normalize_total equivalent) 2. Log1p transformation (scanpy.pp.log1p equivalent)

The output is written as a sparse CSR matrix. This is the streaming equivalent of calling scanpy.pp.normalize_total followed by scanpy.pp.log1p.

Parameters:

data (str | Path | 'AnnData' | ad.AnnData) – Path to source h5ad file, or a backed AnnData object.
output_path (str | Path | None) – Path for output h5ad file. If None, uses output_dir/data_name pattern.
normalize (bool) – Whether to apply total-count normalization. Default True.
log1p (bool) – Whether to apply log1p transformation. Default True.
target_sum (float) – Target total counts per cell after normalization. Default 1e4. Only used if normalize=True.
chunk_size (int) – Number of cells to process per chunk. Default 4096.
output_dir (str | Path | None) – Directory for output file. Defaults to input file’s directory.
data_name (str | None) – Custom name for output file. If None, uses “normalized” suffix.
format_mismatch_policy (Literal['warn', 'convert', 'off']) –
How to handle a source stored as CSC, whose row-(cell-)streaming here is O(total_nnz) per chunk and can be ~100x slower than CSR: - "warn" (default): proceed but log a single actionable warning. - "convert": transparently convert the source to CSR in a temporary

file (bounded-memory two-pass streaming) and stream from that; the temporary file is removed before returning.
- "off": proceed silently with no warning.
verbose (bool) – Print progress information.

Returns:

Read-only AnnData wrapper pointing to the output file.

Return type:

AnnData

Examples

>>> # Full normalization + log1p (default)
>>> adata_norm = cx.pp.normalize_total_log1p(adata_ro, output_dir=OUTPUT_DIR, data_name="normalized")

>>> # Only log1p (no normalization)
>>> adata_log = cx.pp.normalize_total_log1p(adata_ro, normalize=False, output_dir=OUTPUT_DIR)

>>> # Only normalization (no log1p)
>>> adata_norm = cx.pp.normalize_total_log1p(adata_ro, log1p=False, output_dir=OUTPUT_DIR)

>>> # Use explicit output path
>>> adata_norm = cx.pp.normalize_total_log1p(adata_ro, "results/normalized.h5ad")

>>> # Auto-convert a CSC source to CSR for fast cell-streaming
>>> adata_norm = cx.pp.normalize_total_log1p(csc_path, format_mismatch_policy="convert")

crispyx.data.convert_to_csc(data, *, output_path=None, chunk_size=None, output_dir=None, data_name=None, verbose=True)[source]

Convert a backed h5ad file’s matrix from CSR (or dense) to CSC format.

CSC format allows O(nnz_in_chunk) column-slicing instead of O(total_nnz) that CSR requires. This is required for efficient Wilcoxon rank-sum testing, which iterates over gene chunks with axis=1 access patterns.

The conversion is done in two streaming passes over the source file so the peak memory is bounded by total_nnz × sizeof(float32 + row_dtype) bytes (the output buffers) plus one row-chunk working buffer. The result is written to disk in a single sequential write which is as fast as possible on HDD/SSD.

If the input file is already CSC, no file is written; the function returns a backed AnnData pointing to the original file.

Parameters:

data (str | Path | 'AnnData' | ad.AnnData) – Path to source h5ad file (CSR or dense), or a backed AnnData.
output_path (str | Path | None) – Explicit path for the output file. If None, a path is derived from output_dir/data_name with "_csc" appended to the stem.
chunk_size (int | None) – Number of rows (cells) to read at a time during both passes. Default 4096.
output_dir (str | Path | None) – Directory for the output file. Defaults to the source file’s directory.
data_name (str | None) – Custom name used when building the output filename.
verbose (bool) – Print progress messages.

Returns:

Backed (read-only) AnnData pointing to the written CSC h5ad file, or to the source file if it was already CSC.

Return type:

AnnData

Examples

>>> adata_csc = cx.pp.convert_to_csc(preprocessed_path, output_dir=OUTPUT_DIR)

crispyx.data.convert_to_csr(data, *, output_path=None, chunk_size=None, output_dir=None, data_name=None, verbose=True)[source]

Convert a backed h5ad file’s matrix from CSC (or dense) to CSR format.

CSR format allows O(nnz_in_chunk) row-slicing instead of O(total_nnz) that CSC requires. This is needed for efficient NB-GLM, size factor computation, and any operation that iterates over cell (row) chunks.

The conversion mirrors convert_to_csc(): two streaming passes over the source file so peak memory is bounded by total_nnz × (sizeof(float32) + sizeof(col_dtype)) bytes (the output buffers) plus one chunk working buffer.

If the input file is already CSR, no file is written; the function returns a backed AnnData pointing to the original file.

Parameters:

data (str | Path | 'AnnData' | ad.AnnData) – Path to source h5ad file (CSC or dense), or a backed AnnData.
output_path (str | Path | None) – Explicit path for the output file. If None, a path is derived from output_dir/data_name with "_csr" appended to the stem.
chunk_size (int | None) – Number of rows (cells) to read at a time during both passes. Default is calculated automatically.
output_dir (str | Path | None) – Directory for the output file. Defaults to the source file’s directory.
data_name (str | None) – Custom name used when building the output filename.
verbose (bool) – Print progress messages.

Returns:

Backed (read-only) AnnData pointing to the written CSR h5ad file, or to the source file if it was already CSR.

Return type:

AnnData

crispyx.data.calculate_optimal_chunk_size(n_obs, n_vars, available_memory_gb=None, safety_factor=8.0, min_chunk=512, max_chunk=4096)[source]

Calculate optimal chunk size based on dataset dimensions and available memory.

Parameters:

n_obs (int) – Number of observations (cells) in the dataset.
n_vars (int) – Number of variables (genes) in the dataset.
available_memory_gb (float | None) – Available memory in gigabytes. If None, auto-detects using psutil.
safety_factor (float) – Safety multiplier to account for overhead (default 8.0 for backed operations).
min_chunk (int) – Minimum chunk size to return (default 512).
max_chunk (int) – Maximum chunk size to return (default 4096).

Returns:

Recommended chunk size, clamped to [min_chunk, max_chunk].

Return type:

Examples

>>> calculate_optimal_chunk_size(100000, 20000, available_memory_gb=32)
2000

crispyx.data.calculate_optimal_gene_chunk_size(n_obs, n_vars, n_groups=None, available_memory_gb=None, safety_factor=8.0, memory_fraction=0.5, min_chunk=32, max_chunk=512)[source]

Calculate optimal gene chunk size for column-wise operations.

For operations that iterate over genes (columns), such as Wilcoxon tests, each chunk loads all cells for a subset of genes. Memory usage is dominated by n_obs × chunk_size rather than chunk_size × n_vars.

Enhanced to account for the number of perturbation groups, which significantly impacts memory usage due to output array allocation.

Parameters:

n_obs (int) – Number of observations (cells) in the dataset.
n_vars (int) – Number of variables (genes) in the dataset.
n_groups (int | None) – Number of perturbation groups. If provided, used to estimate memory for output arrays. Large group counts require smaller chunks.
available_memory_gb (float | None) – Available memory in gigabytes. If None, auto-detects using psutil.
safety_factor (float) – Safety multiplier to account for overhead (default 8.0).
memory_fraction (float) – Fraction of available memory to use (default 0.5). Leave headroom for memory-mapped arrays and system overhead.
min_chunk (int) – Minimum chunk size to return (default 32).
max_chunk (int) – Maximum chunk size to return (default 512).

Returns:

Recommended gene chunk size, clamped to [min_chunk, max_chunk].

Return type:

Examples

>>> calculate_optimal_gene_chunk_size(100000, 20000, available_memory_gb=32)
512
>>> calculate_optimal_gene_chunk_size(4000000, 38000, n_groups=18000, available_memory_gb=128)
64

crispyx.data.calculate_wilcoxon_chunk_size(n_obs, n_vars, *, available_memory_gb=None, min_chunk=32, max_chunk=4096)[source]

Calculate optimal gene chunk size for Wilcoxon rank-sum tests.

Unlike calculate_optimal_gene_chunk_size(), this function has no n_groups cap. Wilcoxon writes all output arrays (effect, pvalue, z-score, etc.) to on-disk memory-mapped files immediately, so peak RAM per chunk is independent of the number of perturbation groups. The only effective cap is a per-chunk transient-memory budget:

transient ≈ chunk_size × (n_obs × 4 + n_ctrl × 8 + n_pert × 4)
≈ chunk_size × n_obs × 12 bytes

The budget is set to 15 % of available_memory_gb so that a single chunk never exceeds ~1/7th of the node RAM.

Parameters:

n_obs (int) – Number of cells in the dataset.
n_vars (int) – Number of genes in the dataset (used only for logging).
available_memory_gb (float | None) – Available memory in GB. When None, detected via psutil. On HPC nodes, pass the SLURM --mem value so the cap reflects the actual job allocation rather than system-wide free memory.
min_chunk (int) – Floor for the returned chunk size (default 32).
max_chunk (int) – Ceiling for the returned chunk size (default 4096). The cell-budget cap is applied before this ceiling, so max_chunk is only active for small/sparse datasets where the cell cap would be very large.

Returns:

Recommended gene chunk size, clamped to [min_chunk, max_chunk].

Return type:

Examples

>>> # Feng-gwsnf: 396K cells, 128 GB → 4067
>>> calculate_wilcoxon_chunk_size(396458, 32373, available_memory_gb=128)
4067
>>> # Feng-ts: 1.16M cells, 128 GB → 1378
>>> calculate_wilcoxon_chunk_size(1161864, 33165, available_memory_gb=128)
1378

crispyx.data.calculate_nb_glm_chunk_size(n_obs, n_vars, n_groups=None, available_memory_gb=None, memory_limit_gb=None, safety_factor=8.0, memory_fraction=0.5, min_chunk=32, max_chunk=256)[source]

Calculate optimal gene chunk size for NB-GLM operations.

NB-GLM iterates over genes (columns) and for each chunk: - Loads dense count data: n_obs × chunk_size - Computes dispersion: requires cell-level statistics - Fits GLM: design matrix operations

This function is specifically tuned for NB-GLM memory patterns, which differ from Wilcoxon (ranking) and t-test (simple statistics).

Parameters:

n_obs (int) – Number of observations (cells) in the dataset.
n_vars (int) – Number of variables (genes) in the dataset.
n_groups (int | None) – Number of perturbation groups. If provided, used to estimate memory for output arrays and design matrix overhead.
available_memory_gb (float | None) – Available memory in gigabytes. If None, auto-detects using psutil.
memory_limit_gb (float | None) – Optional hard memory limit in GB. If provided, uses the minimum of available memory and this limit.
safety_factor (float) – Safety multiplier to account for overhead (default 8.0).
memory_fraction (float) – Fraction of available memory to use (default 0.5).
min_chunk (int) – Minimum chunk size to return (default 32).
max_chunk (int) – Maximum chunk size to return (default 256).

Returns:

Recommended gene chunk size, clamped to [min_chunk, max_chunk]. For datasets where memory is sufficient, returns max_chunk (256). Only reduces chunk size when memory would be exceeded.

Return type:

Examples

>>> calculate_nb_glm_chunk_size(100000, 20000, n_groups=100, available_memory_gb=128)
256
>>> calculate_nb_glm_chunk_size(1200000, 36000, n_groups=500, available_memory_gb=128)
143

crispyx.data.calculate_pca_chunk_size(n_obs, n_vars, n_comps=50, available_memory_gb=None, method='auto', memory_fraction=0.5, min_chunk=256, max_chunk=4096)[source]

Calculate optimal chunk size for streaming PCA.

PCA memory usage depends on the method: - sparse_cov: O(genes²) for covariance matrix, fast for small gene counts - incremental: O(chunk × genes) for data chunks, better for large gene counts

Parameters:

n_obs (int) – Number of observations (cells) in the dataset.
n_vars (int) – Number of variables (genes) in the dataset.
n_comps (int) – Number of principal components to compute. Default 50.
available_memory_gb (float | None) – Available memory in gigabytes. If None, auto-detects using psutil.
method (str) – PCA method: ‘auto’, ‘sparse_cov’, or ‘incremental’. ‘auto’ selects based on gene count and available memory.
memory_fraction (float) – Fraction of available memory to use (default 0.5).
min_chunk (int) – Minimum chunk size to return (default 256).
max_chunk (int) – Maximum chunk size to return (default 4096).

Returns:

(chunk_size, selected_method) where selected_method is ‘sparse_cov’ or ‘incremental’.

Return type:

tuple[int, str]

Examples

>>> calculate_pca_chunk_size(100000, 8000, available_memory_gb=32)
(2048, 'sparse_cov')
>>> calculate_pca_chunk_size(100000, 50000, available_memory_gb=32)
(1024, 'incremental')

crispyx.data.calculate_adaptive_qc_thresholds(adata, perturbation_column, mode='conservative', sample_size=10000, chunk_size=None)[source]

Calculate adaptive QC thresholds based on data distribution.

Uses percentile-based approach to determine appropriate QC parameters that retain most of the data while filtering outliers.

Parameters:

adata (AnnData) – AnnData object (can be backed).
perturbation_column (str) – Column in adata.obs containing perturbation labels.
mode (str) – ‘conservative’ (10th percentile, retains ~90%) or ‘aggressive’ (5th percentile, retains ~95%).
sample_size (int) – Maximum number of cells to sample for gene expression analysis.
chunk_size (int | None) – Optional fixed chunk size to use. If None, calculated adaptively.

Returns:

Dictionary with keys: min_genes, min_cells_per_perturbation, min_cells_per_gene, chunk_size.

Return type:

dict

Examples

>>> adata = ad.read_h5ad("data.h5ad", backed='r')
>>> thresholds = calculate_adaptive_qc_thresholds(adata, "perturbation")
>>> adata.file.close()

crispyx.data.standardize_dataset(dataset_path, perturbation_column, control_label, gene_name_column, output_dir, force=False)[source]

Standardize dataset column names and control labels with caching.

Creates a standardized copy of the dataset with: - perturbation_column renamed to ‘perturbation’ - control labels standardized to ‘control’ - gene_name_column set as var.index if specified

This function uses a streaming approach to avoid loading the X matrix into memory, making it suitable for very large datasets (>1M cells).

Standardized files are cached in {output_dir}/.cache/ and reused unless force=True.

Parameters:

dataset_path (Path | str) – Path to original dataset (.h5ad file).
perturbation_column (str) – Name of perturbation column in original dataset.
control_label (str | None) – Control label to standardize. If None, auto-detects.
gene_name_column (str | None) – Gene name column to use as var.index. If None, uses existing var.index.
output_dir (Path | str) – Directory for cached standardized files.
force (bool) – If True, regenerate standardized file even if cache exists.

Returns:

Path to standardized dataset (either cached or newly created).

Return type:

Path

Examples

>>> standardized_path = standardize_dataset(
...     "data/original.h5ad",
...     perturbation_column="gene",
...     control_label=None,
...     gene_name_column="gene_symbols",
...     output_dir="results",
...     force=False
... )

crispyx.data.needs_sorting_for_nbglm(path, perturbation_column='perturbation', *, min_cells=360000, min_perturbations=100, contiguity_threshold=0.5)[source]

Check if a dataset would benefit from sorting by perturbation for NB-GLM.

Large datasets with scattered cells benefit from having cells sorted by perturbation label, as this enables contiguous I/O reads instead of random access. This is especially important when the data is stored on HDD (rotational disk).

The default thresholds are based on I/O overhead analysis: - At ~100 IOPS (typical HDD), 360K cells = 1 hour of random I/O overhead - min_perturbations=100 ensures sufficient parallel workload to benefit - contiguity_threshold=0.5 catches scattered datasets

Parameters:

path (str | Path) – Path to h5ad file.
perturbation_column (str) – Column in obs containing perturbation labels.
min_cells (int) – Minimum number of cells for sorting to be recommended. Default: 360,000 (~1 hour of random I/O on HDD at 100 IOPS).
min_perturbations (int) – Minimum number of perturbations for sorting to be recommended.
contiguity_threshold (float) – If average contiguity is below this threshold, sorting is recommended. Contiguity is the fraction of a perturbation’s cells that are in a contiguous block (1.0 = perfectly contiguous, 0.0 = completely scattered).

Returns:

True if sorting is recommended, False otherwise.

Return type:

bool

crispyx.data.sort_by_perturbation(path, perturbation_column='perturbation', control_label=None, *, output_path=None, chunk_size=4096, force=False)[source]

Sort cells by perturbation label for contiguous I/O access.

Creates a new h5ad file with cells reordered so that all cells from each perturbation are contiguous. Control cells are placed first, followed by each perturbation group in alphabetical order. This enables efficient sequential reads when processing perturbations in parallel.

The function works in streaming mode to handle datasets larger than memory. Sorting information is stored in uns[‘sorting_metadata’].

Parameters:

path (str | Path) – Path to input h5ad file.
perturbation_column (str) – Column in obs containing perturbation labels.
control_label (str | None) – Label for control cells. If None, auto-detected from common patterns.
output_path (str | Path | None) – Path for output sorted file. If None, appends ‘_sorted’ to input name.
chunk_size (int) – Number of cells to process per chunk during streaming write.
force (bool) – If True, recreate sorted file even if it already exists.

Returns:

Path to the sorted h5ad file.

Return type:

Path

Examples

>>> sorted_path = sort_by_perturbation(
...     "data/large_dataset.h5ad",
...     perturbation_column="perturbation",
...     control_label="control",
... )
>>> # sorted_path is now "data/large_dataset_sorted.h5ad"

Notes

The sorted file contains additional metadata in uns[‘sorting_metadata’]: - original_path: Path to the original unsorted file - sort_order: Array mapping new indices to original indices - perturbation_boundaries: Dict mapping perturbation labels to (start, end) indices - sorted_by: The column used for sorting - timestamp: When the sorting was performed

For sparse inputs the output is always written in CSR format, since sorting benefits row-wise (per-perturbation) access patterns used by NB-GLM. CSC files used by Wilcoxon do not need perturbation sorting.

crispyx.data.get_perturbation_slice(adata_or_path, perturbation_label, perturbation_column='perturbation')[source]

Get slice for a perturbation’s cells, checking if file is sorted.

If the file is sorted by perturbation, returns a contiguous slice. Otherwise, returns None for the slice (caller should use boolean mask).

Parameters:

adata_or_path (str | Path | AnnData) – Path to h5ad file, or an already-opened AnnData object.
perturbation_label (str) – Label of the perturbation to get slice for.
perturbation_column (str) – Column containing perturbation labels.

Returns:

(slice object or None, is_sorted flag). If is_sorted is True, slice is valid for contiguous access. If is_sorted is False, slice is None and caller should use mask.

Return type:

tuple[slice | None, bool]

crispyx.data.load_obs(path)[source]

Load the obs metadata table from an h5ad file without reading X.

Parameters:: path (str | Path) – Path to the h5ad file.
Returns:: Full obs DataFrame in memory.
Return type:: pd.DataFrame

crispyx.data.load_var(path)[source]

Load the var metadata table from an h5ad file without reading X.

Parameters:: path (str | Path) – Path to the h5ad file.
Returns:: Full var DataFrame in memory.
Return type:: pd.DataFrame

crispyx.data.write_obs(path, df)[source]

Overwrite the obs metadata table in an h5ad file without touching X.

Parameters:

path (str | Path) – Path to the h5ad file (modified in-place).
df (DataFrame) – New obs DataFrame. Must have the same number of rows as the existing obs table. Index values are written as cell barcodes.

Raises:

ValueError – If the DataFrame length does not match the existing n_obs.

Return type:

None

crispyx.data.write_var(path, df)[source]

Overwrite the var metadata table in an h5ad file without touching X.

Parameters:

path (str | Path) – Path to the h5ad file (modified in-place).
df (DataFrame) – New var DataFrame. Must have the same number of rows as the existing var table.

Raises:

ValueError – If the DataFrame length does not match the existing n_vars.

Return type:

None

crispyx.data.standardise_gene_names(path, *, column=None, strip_version=True, normalise_mt_prefix=True, lookup_symbols=False, species='human', unmapped_action='warn', inplace=True)[source]

Standardise gene identifiers in the var metadata table.

Applies a deterministic normalisation pipeline:

Strip Ensembl version suffixes (ENSG00000123.4 → ENSG00000123).
Normalise mt- prefix to MT- (human mitochondrial convention).
Optionally resolve Ensembl IDs to HGNC symbols via mygene (requires pip install mygene). A tqdm progress bar is shown during batched lookups.

Parameters:

path (str | Path) – Path to the h5ad file.
column (str | None) – var column to normalise. None normalises the index (var_names).
strip_version (bool) – Strip ".N" Ensembl version suffixes.
normalise_mt_prefix (bool) – Convert lower-case mt- prefix to MT-.
lookup_symbols (bool) – If True, query mygene to map Ensembl IDs → gene symbols.
species (str) – Species string passed to mygene (default "human").
unmapped_action (Literal['keep', 'error', 'warn']) – What to do for IDs not found by mygene: "keep" leaves them unchanged, "warn" emits a warning, "error" raises.
inplace (bool) – If True, write the result back to the file and return None. If False, return a Series without modifying the file.

Returns:

Normalised gene names when inplace=False, else None.

Return type:

pd.Series or None

crispyx.data.normalise_perturbation_labels(path, column, *, strip_prefixes=None, strip_suffixes=None, strip_suffix_regex=None, control_aliases=None, canonical_control='NTC', inplace=True)[source]

Normalise perturbation labels stored in an obs column.

Applies transformations in order:

Strip specified prefixes via vectorised pd.Series.str.replace.
Strip specified suffixes via vectorised pd.Series.str.replace.
Apply a custom regex substitution (strip_suffix_regex).
Map known control aliases to canonical_control.

Parameters:

path (str | Path) – Path to the h5ad file.
column (str) – obs column containing perturbation labels.
strip_prefixes (list[str] | None) – List of prefix strings to remove (e.g. ["sg-", "sg"]).
strip_suffixes (list[str] | None) – List of suffix strings to remove (e.g. ["_KO", "_KD", "_P1P2"]).
strip_suffix_regex (str | None) – A Python regex applied via pd.Series.str.replace after prefix/suffix stripping.
control_aliases (list[str] | None) – Additional strings (case-insensitive) treated as control labels. The built-in aliases (ntc, ctrl, scramble, …) are always included.
canonical_control (str) – Canonical control label substituted for all matched aliases.
inplace (bool) – If True, write result back and return None. If False, return a Series without modifying the file.

Returns:

Normalised labels when inplace=False, else None.

Return type:

pd.Series or None

crispyx.data.detect_perturbation_column(adata, *, control_label=None, min_unique=2, verbose=True)[source]

Heuristically identify the obs column containing perturbation labels.

Scoring:

+3 if column name matches known aliases (perturbation, gene_target, …).
+2 if dtype is categorical or object.
+1 if unique-value count is in [min_unique, 5000].
+2 if at least one value matches a known control synonym or control_label when provided.

Parameters:

adata (str | Path | AnnData | AnnData) – Backed AnnData, AnnData, or path to h5ad file.
control_label (str | None) – Known control label; boosts the score for columns containing it.
min_unique (int) – Minimum number of unique values required for the +1 bonus.
verbose (bool) – Log the detected column name.

Returns:

Column name with the highest score, or None if no column scores above zero.

Return type:

str or None

crispyx.data.detect_gene_symbol_column(adata, *, verbose=True)[source]

Heuristically identify the var column containing gene symbols.

Scoring:

+3 if column name matches known aliases (gene_symbols, symbol, …).
+2 if values pass _validate_gene_symbols() without error.
+1 if values do not start with Ensembl prefixes.

Returns None when no column qualifies, which signals that var_names should be used as a fallback.

Parameters:

adata (str | Path | AnnData | AnnData) – Backed AnnData, AnnData, or path to h5ad file.
verbose (bool) – Log the detected column name.

Return type:

str or None

crispyx.data.infer_columns(adata, *, control_label=None, verbose=True)[source]

Detect perturbation and gene-symbol columns in a single call.

Parameters:

adata (str | Path | AnnData | AnnData) – Backed AnnData, AnnData, or path to h5ad file.
control_label (str | None) – Known control label forwarded to detect_perturbation_column().
verbose (bool) – Log detected column names.

Returns:

{"perturbation_column": ..., "gene_name_column": ...} where each value is the detected column name or None.

Return type:

dict

class crispyx.data.OverlapResult(count_matrix, jaccard_matrix, set_sizes)[source]

Bases: object

Pairwise overlap statistics between named sets.

Parameters:

count_matrix (DataFrame)
jaccard_matrix (DataFrame)
set_sizes (Series)

count_matrix

(n_sets × n_sets) DataFrame of pairwise intersection sizes.

Type:: pandas.core.frame.DataFrame

jaccard_matrix

(n_sets × n_sets) DataFrame of Jaccard similarity coefficients.

Type:: pandas.core.frame.DataFrame

set_sizes

Series of sizes for each input set.

Type:: pandas.core.series.Series

count_matrix: DataFrame

jaccard_matrix: DataFrame

set_sizes: Series

crispyx.data.compute_overlap(sets_dict, *, metric='both')[source]

Compute pairwise overlap statistics between named sets.

Parameters:

sets_dict (dict[str, set | list]) – Mapping of name → set (or list, converted to set).
metric (Literal['count', 'jaccard', 'both']) – Which matrices to populate: "count", "jaccard", or "both" (default).

Returns:

Object with count_matrix, jaccard_matrix, and set_sizes.

Return type:

OverlapResult

Examples

>>> result = cx.tl.compute_overlap({
...     "dataset_A": {"BRCA1", "TP53", "EGFR"},
...     "dataset_B": {"TP53", "KRAS"},
... })
>>> result.jaccard_matrix

Quality control

Quality control utilities for large .h5ad datasets.

class crispyx.qc.QualityControlResult(cell_mask, gene_mask, perturbation_keep, filtered, cell_gene_counts, gene_cell_counts)[source]

Bases: object

Result of quality control filtering.

Parameters:

cell_mask (ndarray)
gene_mask (ndarray)
perturbation_keep (Dict[str, bool])
filtered (AnnData | None)
cell_gene_counts (ndarray)
gene_cell_counts (ndarray)

cell_mask: ndarray

gene_mask: ndarray

perturbation_keep: Dict[str, bool]

filtered: AnnData | None

cell_gene_counts: ndarray

gene_cell_counts: ndarray

property filtered_path: Path | None

Compatibility accessor exposing the on-disk filename.

Returns None if no output file was written (output_dir was None).

crispyx.qc.filter_cells_by_gene_count(data, *, min_genes=100, gene_name_column=None, chunk_size=2048, return_counts=False, return_full_result=False)[source]

Return a boolean mask selecting cells with at least min_genes expressed genes.

This function can optionally compute both genes-per-cell (row nnz) AND cells-per-gene (column nnz) in a single matrix pass, avoiding a separate gene counting pass later.

Parameters:

data (str | Path | AnnData | AnnData) – Path to h5ad file, or a crispyx/anndata AnnData object.
min_genes (int) – Minimum number of expressed genes per cell.
gene_name_column (str | None) – Column in var containing gene names.
chunk_size (int) – Number of cells to process per chunk.
return_counts (bool) – If True, return (mask, counts) tuple instead of just mask. Ignored if return_full_result is True.
return_full_result (bool) – If True, return a _CellFilterResult containing cell_mask, gene_counts_per_cell, and gene_cell_counts_all (cells per gene for all cells, before any perturbation filtering).

Returns:

Boolean mask, optionally with counts, or full result dataclass.

Return type:

mask or (mask, counts) or _CellFilterResult

crispyx.qc.filter_perturbations_by_cell_count(data, *, perturbation_column, control_label=None, min_cells=50, base_mask=None, return_counts=False)[source]

Return a mask keeping cells whose perturbation has sufficient representation.

Parameters:

data (str | Path | AnnData | AnnData) – Path to h5ad file, or a crispyx/anndata AnnData object.
perturbation_column (str) – Column in obs containing perturbation labels.
control_label (str | None) – Label identifying control cells. If None, auto-detected.
min_cells (int) – Minimum number of cells required per perturbation.
base_mask (ndarray | None) – Optional mask for cells to consider (e.g., from prior filtering).
return_counts (bool) – If True, return (mask, cell_counts_per_perturbation) tuple.

Returns:

Boolean mask, optionally with cell counts per perturbation label.

Return type:

mask or (mask, counts)

crispyx.qc.filter_genes_by_cell_count(data, *, min_cells=100, cell_mask=None, gene_name_column=None, chunk_size=2048, return_counts=False)[source]

Return a boolean mask selecting genes expressed in at least min_cells cells.

Parameters:

data (str | Path | AnnData | AnnData) – Path to h5ad file, or a crispyx/anndata AnnData object.
min_cells (int) – Minimum number of cells expressing each gene.
cell_mask (ndarray | None) – Optional mask for cells to consider.
gene_name_column (str | None) – Column in var containing gene names.
chunk_size (int) – Number of cells to process per chunk.
return_counts (bool) – If True, return (mask, counts) tuple instead of just mask.

Returns:

Boolean mask, optionally with the raw cell counts per gene.

Return type:

mask or (mask, counts)

crispyx.qc.quality_control_summary(data, *, min_genes=100, min_cells_per_perturbation=50, min_cells_per_gene=100, perturbation_column, control_label=None, gene_name_column=None, chunk_size=None, memory_limit_gb=None, data_name=None, output_path=None, output_dir=None, cache_mode='memmap', delta_threshold=0.3, force_streaming=False, verbose=False)[source]

Run QC with automatic strategy selection for optimal performance.

This function automatically selects the best QC strategy based on: 1. Small data (any format): In-memory processing (fastest) 2. Large CSC data: Column-oriented streaming (memory efficient for CSC) 3. Large CSR/dense data: Row-oriented streaming (current behavior)

Parameters:

data (str | Path | AnnData | AnnData) – Path to h5ad file, or a crispyx/anndata AnnData object.
min_genes (int) – Minimum number of expressed genes per cell.
min_cells_per_perturbation (int) – Minimum number of cells required per perturbation.
min_cells_per_gene (int) – Minimum number of cells expressing each gene.
perturbation_column (str) – Column in obs containing perturbation labels.
control_label (str | None) – Label identifying control cells. If None, auto-detected.
gene_name_column (str | None) – Column in var containing gene names.
chunk_size (int | None) – Number of cells to process per chunk. If None, automatically calculated based on available memory.
memory_limit_gb (float | None) – Optional memory limit in GB for strategy selection and chunk size. If None, auto-detected from system memory.
data_name (str | None) – Base name for output files.
output_path (str | Path | None) – Exact path for the output h5ad file. When provided, output_dir and data_name are ignored.
output_dir (str | Path | None) – Directory for output files. If None, returns QC masks without writing a filtered h5ad file (QualityControlResult.filtered will be None). Deprecated; use output_path instead. Will be removed in the next major version.
cache_mode (Literal['memory', 'memmap', 'none']) – Cache strategy for row-oriented streaming: ‘memory’ (fast, high RAM), ‘memmap’ (low RAM, disk-based), or ‘none’ (no caching, requires re-reading source during write). Default is ‘memmap’ for better memory efficiency.
delta_threshold (float) – Threshold for delta adjustment in row-oriented streaming. Default 0.3 (30%).
force_streaming (bool) – If True, always use streaming path regardless of data size. Useful for testing or memory-constrained environments.
verbose (int | bool)

Returns:

Dataclass containing masks, filtered AnnData (or None if output_dir is None), and QC statistics.

Return type:

QualityControlResult

Pseudo-bulk aggregation

Pseudo-bulk effect size estimators operating directly on .h5ad files.

crispyx.pseudobulk.compute_average_log_expression(data, *, perturbation_column, control_label=None, gene_name_column=None, perturbations=None, batch_column=None, chunk_size=None, memory_limit_gb=None, data_name=None, output_path=None, output_dir=None, verbose=False)[source]

Compute average log-normalised expression per perturbation relative to control.

For each perturbation group, computes the per-gene mean of log1p-normalised expression and stores the difference relative to the control group as the effect size.

Parameters:

data (str | Path | AnnData | AnnData) – Path to an h5ad file, or a backed/in-memory AnnData object.
perturbation_column (str) – Column in adata.obs that identifies perturbation groups.
control_label (str | None) – Label of the control group. If None, inferred from common patterns ('non-targeting', 'control', etc.).
gene_name_column (str | None) – Column in adata.var with gene symbols. If None, uses adata.var_names.
perturbations (Iterable[str] | None) – Subset of perturbation labels to include. If None, all non-control groups are processed.
batch_column (str | None) – Column in adata.obs identifying the batch of each cell. When provided, effects are computed within each batch and combined with harmonic-count weights (w_b = n_pert_b * n_ctrl_b / (n_pert_b + n_ctrl_b)), removing batch-driven confounding. The per-(perturbation, batch) accumulator is spilled to a disk-backed np.memmap so peak memory stays bounded regardless of the number of batches. When None (default), a single pooled effect is computed.
chunk_size (int | None) – Number of cells to process per chunk. If None (default), auto-determined from the dataset shape and the available memory budget (see memory_limit_gb).
memory_limit_gb (float | None) – Soft memory budget in gigabytes used to size the streaming cell chunk. When None (default), the available system memory is auto-detected via psutil. Passing a value (e.g. memory_limit_gb=128) caps the budget for SLURM / cgroup-constrained environments. Ignored when an explicit chunk_size is given. Only the chunk size is affected; computed values are identical regardless of the budget.
data_name (str | None) – Custom stem for the output filename. If None, the input file stem is used with a _cx_avg_log_effects suffix.
output_path (str | Path | None) – Exact path for the output h5ad file. When provided, output_dir and data_name are ignored.
output_dir (str | Path | None) – Directory for the output file. Defaults to the input file’s directory. Deprecated – use output_path instead. Will be removed in the next major version.
verbose (int | bool) – Verbosity level. 0 / False is silent; 1 / True prints a summary line.

Returns:

On-disk AnnData where X contains the effect-size matrix (perturbation mean minus control mean in log-normalised space), layers['perturbation_mean'] contains per-perturbation means, and uns['control_mean'] contains the control mean vector. When batch_column is set, X holds the batch-corrected effect (harmonic-count weighted average of within-batch differences), layers['perturbation_mean'] holds the batch-corrected per-perturbation mean, layers['control_mean_matched'] holds the per-perturbation weight-matched control reference (so X = perturbation_mean - control_mean_matched), uns['control_mean'] retains the pooled control mean, and uns['batch_column'] / uns['batch_ids'] record the batch column name and the batch labels encountered.

Return type:

AnnData

crispyx.pseudobulk.compute_pseudobulk_expression(data, *, perturbation_column, control_label=None, gene_name_column=None, perturbations=None, batch_column=None, baseline_count=1.0, chunk_size=None, memory_limit_gb=None, data_name=None, output_path=None, output_dir=None, verbose=False)[source]

Compute pseudo-bulk log-fold changes relative to control.

Aggregates normalised counts per perturbation group into a pseudo-bulk profile (sum divided by cell count), applies log1p scaling with a baseline_count offset, and stores the difference relative to the control group as the log-fold change effect size.

Parameters:

data (str | Path | AnnData | AnnData) – Path to an h5ad file, or a backed/in-memory AnnData object.
perturbation_column (str) – Column in adata.obs that identifies perturbation groups.
control_label (str | None) – Label of the control group. If None, inferred from common patterns ('non-targeting', 'control', etc.).
gene_name_column (str | None) – Column in adata.var with gene symbols. If None, uses adata.var_names.
perturbations (Iterable[str] | None) – Subset of perturbation labels to include. If None, all non-control groups are processed.
batch_column (str | None) – Column in adata.obs identifying the batch of each cell. When provided, log-fold changes are computed within each batch and combined with harmonic-count weights (w_b = n_pert_b * n_ctrl_b / (n_pert_b + n_ctrl_b)), removing batch-driven confounding. The per-(perturbation, batch) accumulator is spilled to a disk-backed np.memmap so peak memory stays bounded regardless of the number of batches. When None (default), a single pooled log-fold change is computed.
baseline_count (float) – Pseudo-count added before log transformation (log1p(baseline_count * mean_counts)). Default 1.0.
chunk_size (int | None) – Number of cells to process per chunk. If None (default), auto-determined from the dataset shape and the available memory budget (see memory_limit_gb).
memory_limit_gb (float | None) – Soft memory budget in gigabytes used to size the streaming cell chunk. When None (default), the available system memory is auto-detected via psutil. Passing a value (e.g. memory_limit_gb=128) caps the budget for SLURM / cgroup-constrained environments. Ignored when an explicit chunk_size is given. Only the chunk size is affected; computed values are identical regardless of the budget.
data_name (str | None) – Custom stem for the output filename. If None, the input file stem is used with a _cx_pseudobulk_effects suffix.
output_path (str | Path | None) – Exact path for the output h5ad file. When provided, output_dir and data_name are ignored.
output_dir (str | Path | None) – Directory for the output file. Defaults to the input file’s directory. Deprecated – use output_path instead. Will be removed in the next major version.
verbose (int | bool) – Verbosity level. 0 / False is silent; 1 / True prints a summary line.

Returns:

On-disk AnnData where X contains the pseudo-bulk log-fold change matrix (perturbation pseudo-bulk minus control pseudo-bulk), layers['perturbation_bulk'] contains per-perturbation pseudo-bulk vectors, uns['control_bulk'] the control pseudo-bulk vector, and uns['baseline_count'] the scaling offset used. When batch_column is set, X holds the batch-corrected log-fold change (harmonic-count weighted average of within-batch differences), layers['perturbation_bulk'] holds the batch-corrected per-perturbation pseudo-bulk, layers['control_bulk_matched'] holds the per-perturbation weight-matched control reference (so X = perturbation_bulk - control_bulk_matched), uns['control_bulk'] retains the pooled control pseudo-bulk, and uns['batch_column'] / uns['batch_ids'] record the batch column name and the batch labels encountered.

Return type:

AnnData

Differential expression

All three DE functions accept a force: bool = False parameter (v0.0.3+). When the expected output .h5ad file already exists on disk, the function reloads and returns the saved result instead of rerunning. Pass force=True to overwrite. See Auto-reload and re-run control in the usage guide.

Differential expression testing utilities.

class crispyx.de.DifferentialExpressionResult(genes: 'pd.Index', effect_size: 'np.ndarray', statistic: 'np.ndarray', pvalue: 'np.ndarray', method: 'str', perturbation: 'str', pvalue_adj: 'np.ndarray | None' = None, result: 'AnnData | None' = None)[source]

Bases: object

Parameters:

genes (Index)
effect_size (ndarray)
statistic (ndarray)
pvalue (ndarray)
method (str)
perturbation (str)
pvalue_adj (ndarray | None)
result (AnnData | None)

genes: Index

effect_size: ndarray

statistic: ndarray

pvalue: ndarray

method: str

perturbation: str

pvalue_adj: ndarray | None = None

result: AnnData | None = None

property result_path: Path

Path to the on-disk result file.

Returns:: Absolute path to the .h5ad file backing the result.
Return type:: Path
Raises:: AttributeError – If the result AnnData has not been initialised.

class crispyx.de.RankGenesGroupsResult(genes: 'pd.Index', groups: 'list[str]', statistics: 'np.ndarray', pvalues: 'np.ndarray', pvalues_adj: 'np.ndarray', logfoldchanges: 'np.ndarray', effect_size: 'np.ndarray', u_statistics: 'np.ndarray', pts: 'np.ndarray', pts_rest: 'np.ndarray', order: 'np.ndarray', groupby: 'str', method: 'str', control_label: 'str', tie_correct: 'bool', pvalue_correction: "Literal['benjamini-hochberg', 'bonferroni']", result: 'AnnData | None' = None)[source]

Bases: Mapping[str, DifferentialExpressionResult]

Parameters:

genes (Index)
groups (list[str])
statistics (ndarray)
pvalues (ndarray)
pvalues_adj (ndarray)
logfoldchanges (ndarray)
effect_size (ndarray)
u_statistics (ndarray)
pts (ndarray)
pts_rest (ndarray)
order (ndarray)
groupby (str)
method (str)
control_label (str)
tie_correct (bool)
pvalue_correction (Literal['benjamini-hochberg', 'bonferroni'])
result (AnnData | None)

genes: Index

groups: list[str]

statistics: ndarray

pvalues: ndarray

pvalues_adj: ndarray

logfoldchanges: ndarray

effect_size: ndarray

u_statistics: ndarray

pts: ndarray

pts_rest: ndarray

order: ndarray

groupby: str

method: str

control_label: str

tie_correct: bool

pvalue_correction: Literal['benjamini-hochberg', 'bonferroni']

result: AnnData | None = None

property result_path: Path

items()[source]: Return (group_name, DifferentialExpressionResult) pairs.

to_rank_genes_groups_dict()[source]

Convert results to Scanpy-compatible rank_genes_groups dict.

Returns:: Dictionary with keys 'params', 'names', 'scores', 'logfoldchanges', 'pvals', 'pvals_adj', 'pts', 'pts_rest', 'auc', 'u_stat', and 'full'. Each value (except 'params' and 'full') is a NumPy record array sorted by descending absolute score.
Return type:: dict

to_full_order_dict()[source]

Return unsorted matrices keyed by statistic name.

Unlike to_rank_genes_groups_dict(), values are in the original gene order (not sorted by score).

Returns:: Keys: 'scores', 'pvals', 'pvals_adj', 'logfoldchanges', 'auc', 'u_stat', 'pts', 'pts_rest'. Each value is a 2-D (n_groups, n_genes) NumPy array (copy).
Return type:: dict

crispyx.de.t_test(data, *, perturbation_column=None, groupby=None, control_label=None, reference=None, gene_name_column=None, perturbations=None, min_cells_expressed=0, min_pct_ctrl=0.01, min_pct_pert=0.002, min_pct_both=None, min_mean_ctrl=0.05, min_mean_pert=0.005, cell_chunk_size=None, data_name=None, output_path=None, output_dir=None, n_jobs=None, verbose=False, resume=False, checkpoint_interval=None, scanpy_format=False, memory_limit_gb=None, force=False)[source]

Perform a t-test comparing log-expression means for each perturbation.

Returns a RankGenesGroupsResult containing differential expression statistics. Results are stored in an h5ad file with layers containing the statistics.

The RankGenesGroupsResult implements the Mapping interface, so it can be used like a dict: result[perturbation_label] returns a DifferentialExpressionResult for that perturbation.

Input data should already be normalized and log-transformed (for example using scanpy.pp.normalize_total followed by scanpy.pp.log1p). To maintain backward compatibility with Scanpy-style workflows, count-like inputs are automatically normalized and log-transformed in streaming fashion, with a warning to encourage explicit preprocessing upstream.

Parameters:

data (str | Path | AnnData | AnnData) – Path to an h5ad file, or a crispyx/anndata AnnData object containing log-transformed expression data.
perturbation_column (str | None) – Column in adata.obs indicating perturbation labels.
groupby (str | None) – Alias for perturbation_column (Scanpy-compatible). Mutually exclusive with perturbation_column.
control_label (str | None) – Label for the control/reference group. If None, infers from common patterns.
reference (str | None) – Alias for control_label (Scanpy-compatible). Mutually exclusive with control_label.
gene_name_column (str | None) – Column in adata.var with gene symbols. If None, uses adata.var_names.
perturbations (Iterable[str] | None) – Specific perturbations to test. If None, tests all non-control groups.
min_cells_expressed (int) – Minimum total cells (control + perturbation) expressing a gene for testing.
min_pct_ctrl (float) – Minimum fraction of expressing cells for the control side. A gene is excluded only when both the control side and the perturbed side are jointly low. Default 0.01.
min_pct_pert (float) – Minimum fraction of expressing cells for the perturbed side. Default 0.002 (lower than ctrl; induction from near-zero baseline is biologically valid). Set to 0.0 to disable the pct check on pert.
min_pct_both (float | None) – If not None, overrides both min_pct_ctrl and min_pct_pert with the same value.
min_mean_ctrl (float) – Minimum mean expression (log1p units) for the control side. Default 0.05. Excluded genes are written as NaN in score / pvalue / logfoldchanges / effect_size; pts and pts_rest remain populated.
min_mean_pert (float) – Minimum mean expression for the perturbed side. Default 0.005. Together with min_pct_pert this forms a dual condition that is more robust to doublet / ambient-RNA artefacts than pct alone.
cell_chunk_size (int | None) – Number of cells to process per chunk (memory vs. speed tradeoff). This controls streaming along the cell axis and is distinct from any future perturbation_chunk_size option that would batch perturbations. Data must already be normalized/log-transformed before chunking.
data_name (str | None) – Custom name for output file. If None, uses “t_test” suffix.
output_path (str | Path | None) – Exact path for the output h5ad file. When provided, output_dir and data_name are ignored.
output_dir (str | Path | None) – Directory for output h5ad file. Defaults to input file’s directory. Deprecated; use output_path instead. Will be removed in the next major version.
n_jobs (int | None) – Number of parallel workers for computing statistics across perturbations. If None, uses all available cores. If 1, runs sequentially. If -1, uses all available cores.
verbose (int | bool) – If True, show a progress bar for perturbation processing. Requires tqdm.
resume (bool) – If True, attempt to resume from a previous interrupted run using checkpoint.
checkpoint_interval (int | None) – Number of perturbations between checkpoint saves. Auto-determined if None.
scanpy_format (bool) – If True, write Scanpy-compatible uns['rank_genes_groups'] structure in addition to the layer-based storage. Adds ~2-6 seconds of I/O overhead for large datasets. Default False for performance.
memory_limit_gb (float | None) – Maximum memory budget in GB. Used for automatic chunk size calculation. When None (default), detects available system memory via psutil. For HPC environments, set this to your SLURM --mem value (e.g., memory_limit_gb=128).
force (bool) – If True, rerun the analysis even when the output h5ad file already exists. If False (default), load and return the existing result instead of rerunning.

Returns:

Differential expression results. Access results via dict-like interface: result[label].effect_size, result[label].pvalue, etc. The h5ad file path is available at result.result_path.

Return type:

crispyx.de.nb_glm_test(data, *, perturbation_column=None, groupby=None, control_label=None, reference=None, covariates=None, gene_name_column=None, perturbations=None, size_factors=None, size_factor_method='sparse', size_factor_scope='global', scale_size_factors=True, dispersion=None, dispersion_method='cox-reid', dispersion_scope='global', share_dispersion=False, use_map_dispersion=True, shrink_dispersion=True, optimization_method='lbfgsb', max_iter=25, tol=1e-06, min_mu=0.5, poisson_init_iter=5, chunk_size=None, irls_batch_size=128, min_cells_expressed=0, min_pct_ctrl=0.01, min_pct_pert=0.002, min_pct_both=None, min_mean_ctrl=0.05, min_mean_pert=0.005, min_total_count=1.0, cook_filter=False, lfc_shrinkage_type='none', lfc_base='log2', corr_method='benjamini-hochberg', se_method='sandwich', data_name=None, output_path=None, output_dir=None, scanpy_format=False, verbose=False, profiling=False, resume=False, checkpoint_interval=None, memory_limit_gb=None, max_dense_fraction=0.3, n_jobs=None, use_control_cache=True, freeze_control=None, force=False)[source]

Perform negative binomial GLM differential expression test.

Returns a RankGenesGroupsResult containing differential expression statistics. Uses a negative binomial GLM framework that can incorporate covariates. Results are stored in an h5ad file with layers containing the statistics.

The RankGenesGroupsResult implements the Mapping interface, so it can be used like a dict: result[perturbation_label] returns a DifferentialExpressionResult for that perturbation.

Parameters:

data (str | Path | AnnData | AnnData) – Path to an h5ad file, or a crispyx/anndata AnnData object containing raw count data.
perturbation_column (str | None) – Column in adata.obs indicating perturbation labels.
groupby (str | None) – Alias for perturbation_column (Scanpy-compatible). Mutually exclusive with perturbation_column.
control_label (str | None) – Label for the control/reference group. If None, infers from common patterns.
reference (str | None) – Alias for control_label (Scanpy-compatible). Mutually exclusive with control_label.
covariates (Iterable[str] | None) – Additional columns in adata.obs to include as covariates in the GLM.
gene_name_column (str | None) – Column in adata.var with gene symbols. If None, uses adata.var_names.
perturbations (Iterable[str] | None) – Specific perturbations to test. If None, tests all non-control groups.
size_factors (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None) – Optional array of per-cell size factors. If None, computes size factors using the method specified by size_factor_method.
size_factor_method (Literal['sparse', 'deseq2']) –
Method for computing size factors when size_factors is None.
- "sparse": Sparse-aware median-of-ratios (default). Computes geometric means using only non-zero values, suitable for sparse single-cell data.
- "deseq2": Classic DESeq2/PyDESeq2 style. Uses only genes expressed in ALL cells (typically ~50-100 genes). Provides better numerical alignment with PyDESeq2 results but may be less robust for very sparse data.
size_factor_scope (Literal['global', 'per_comparison']) –
Scope for size factor computation.
- "global" (default): Compute size factors once on the full dataset. Recommended for CRISPR screens where all cells come from the same experiment and share a common sequencing depth distribution. Faster when combined with use_control_cache=True. Note: produces different results from PyDESeq2 (rho ~ 0.7-0.8) which uses per-comparison normalization.
- "per_comparison": Compute size factors separately for each control + perturbation comparison. This matches PyDESeq2’s behavior exactly, leading to near-perfect LFC, statistic, and p-value concordance (rho > 0.97 on Tian-crispra, rho > 0.99 on Adamson_subset). Use this when PyDESeq2 compatibility is required or for bulk RNA-seq style analysis.
scale_size_factors (bool) – If True (default), scale size factors so their geometric mean equals 1. This is the standard DESeq2/crispyx behavior. If False, use raw median-of-ratios size factors without rescaling, which matches PyDESeq2’s default behavior and can improve numerical alignment.
dispersion (float | None) – Fixed dispersion parameter for negative binomial. If None, estimates per gene.
dispersion_method (Literal['moments', 'cox-reid']) –
Method for estimating dispersion when dispersion is None.
- "moments": Method-of-moments (fast but less accurate).
- "cox-reid": Cox-Reid adjusted profile likelihood (slower but more accurate, similar to DESeq2). This is the default.
dispersion_scope (Literal['global', 'per_comparison']) –
Scope for dispersion estimation.
- "global" (default): Precompute dispersion once using all cells (control + all perturbations). This is ~10x faster for multi-perturbation datasets since MAP dispersion estimation is done once instead of per-comparison. Recommended when perturbation effects are expected to be small relative to baseline expression (typical for CRISPR screens).
- "per_comparison": Estimate dispersion separately for each control + perturbation comparison. More accurate when perturbations cause large changes in gene expression variance, but significantly slower.
share_dispersion (bool) – If True, estimate dispersion once using all cells, then use the same dispersion values for all Wald tests. If False (default), estimate dispersion separately for each perturbation comparison.
use_map_dispersion (bool) – If True (default), use MAP dispersion estimation with mean-dispersion trend. If False, use MLE dispersion without trend-based shrinkage.
shrink_dispersion (bool) – If True, fit a mean-dispersion trend and shrink gene-wise dispersions toward the trend using an empirical Bayes prior.
optimization_method (Literal['irls', 'lbfgsb']) –
Method for coefficient optimization.
- "lbfgsb": L-BFGS-B optimization (PyDESeq2 style, default). Directly optimizes the negative binomial log-likelihood.
- "irls": Iteratively Reweighted Least Squares (Fisher scoring). The classic GLM fitting approach.
max_iter (int) – Maximum iterations for GLM fitting.
tol (float) – Convergence tolerance for GLM fitting.
min_mu (float) – Minimum mean threshold for IRLS numerical stability. Predicted means are clamped to max(min_mu, predicted) during fitting to prevent numerical instability from very small means. Default: 0.5, matching PyDESeq2’s default. Set to 0.0 to disable clamping.
poisson_init_iter (int) – Initial Poisson iterations before switching to negative binomial.
chunk_size (int | None) – Number of genes to process per chunk (memory vs. speed tradeoff). Smaller values stream more, reducing peak memory at the cost of additional I/O.
irls_batch_size (int | None) – Maximum number of genes to densify per IRLS step. Keep this small to limit per-iteration memory when working with large sparse matrices. Set to None to process each chunk without additional batching.
min_cells_expressed (int) – Minimum total cells (control + perturbation) expressing a gene for testing.
min_total_count (float) – Minimum total count across all cells for a gene to be tested.
min_pct_ctrl (float) – Minimum fraction of expressing cells for the control side. A gene is excluded only when both sides are jointly low. Default 0.01.
min_pct_pert (float) – Minimum fraction of expressing cells for the perturbed side. Default 0.002. Combined with min_mean_pert this forms a dual condition that is more robust to doublet / ambient-RNA artefacts.
min_pct_both (float | None) – If not None, overrides both min_pct_ctrl and min_pct_pert with the same value.
min_mean_ctrl (float) – Minimum mean (size-factor-normalised) expression for the control side. Default 0.05. Excluded genes appear as NaN in pvalue / effect / logfc / se; pts and mean remain populated.
min_mean_pert (float) – Minimum mean expression for the perturbed side. Default 0.005.
cook_filter (bool) – Whether to apply Cook’s distance outlier filtering when available.
lfc_shrinkage_type (Literal['apeglm', 'none']) –
Type of log-fold change shrinkage to apply.
- "none": No shrinkage (default).
- "apeglm": Adaptive shrinkage using Cauchy prior (PyDESeq2-compatible). Preserves large effects while shrinking small/uncertain effects toward zero. Also updates standard errors to reflect posterior uncertainty.
lfc_base (Literal['log2', 'ln']) –
Log base for fold change output.
- "log2" (default): Output log2 fold change, matching PyDESeq2/edgeR.
- "ln": Output natural log fold change (raw GLM coefficients).
Standard error is also converted to match the selected log base. Wald statistics remain unchanged since both LFC and SE are scaled equally.
corr_method (Literal['benjamini-hochberg', 'bonferroni']) – Method for p-value correction: "benjamini-hochberg" or "bonferroni".
se_method (Literal['sandwich', 'fisher']) –
Method for computing standard errors.
- "sandwich" (default): Sandwich estimator SE = sqrt(c’ @ H @ M @ H @ c). More robust to model misspecification.
- "fisher": Standard Fisher information SE = sqrt(diag(inv(X’WX + ridge*I))). Matches PyDESeq2’s approach for better p-value parity.
data_name (str | None) – Custom name for output file. If None, uses “nb_glm” suffix.
output_path (str | Path | None) – Exact path for the output h5ad file. When provided, output_dir and data_name are ignored.
output_dir (str | Path | None) – Directory for output h5ad file. Defaults to input file’s directory. Deprecated; use output_path instead. Will be removed in the next major version.
scanpy_format (bool) – If True, write Scanpy-compatible uns['rank_genes_groups'] structure in addition to the layer-based storage. Adds ~2-6 seconds of I/O overhead for large datasets. Default False for performance.
verbose (int | bool) – If True, show a progress bar for perturbation fitting and log per-perturbation completion at DEBUG level. Requires tqdm to be installed for progress bar.
profiling (bool) – If True, enable timing and memory profiling. When enabled, stores profiling data in adata.uns["profiling"] with fields fit_seconds, fit_peak_memory_mb, and profiling_enabled. When False (default), adata.uns["profiling"] is set to "NA" to avoid profiling overhead in production.
resume (bool) – If True, attempt to resume from a previous interrupted run. Reads the checkpoint file to determine which perturbations have already been completed and skips them. If the checkpoint file is missing or corrupted, falls back to scanning the output h5ad to detect completed perturbations.
checkpoint_interval (int | None) – Number of perturbations to process between checkpoint saves. If None, auto-determined based on dataset size (1 for <100 perturbations, 10 for <1000, 50 for larger). The checkpoint file <output>_progress.json is written atomically to prevent corruption.
memory_limit_gb (float | None) – Optional memory limit in GB. If provided, this is used together with available system memory to determine when to switch to streaming mode for global dispersion estimation. The effective limit is min(available_memory, memory_limit_gb). Default is None (use system memory only).
max_dense_fraction (float) – Maximum fraction of available memory to use for dense matrix operations. If the estimated memory for densifying the full cell×gene matrix exceeds max_dense_fraction × min(available_memory, memory_limit_gb), the function switches to streaming mode. Default is 0.3 (30% of available memory).
n_jobs (int | None) – Number of parallel workers for fitting GLMs across perturbations. If None, uses all available cores. If 1, runs sequentially. If -1, uses all available cores.
use_control_cache (bool) – If True (default), precompute control cell statistics (intercept, weights, XᵀWX contributions) once and reuse them across all perturbation comparisons. This can significantly reduce memory and computation time when there are many perturbations and the control group is large. Only applies when no covariates are specified and size_factor_scope=”global”.
freeze_control (bool | None) –
Whether to use frozen control sufficient statistics instead of raw control matrix for parallel fitting. This dramatically reduces per-worker memory from ~5GB to ~1MB for large datasets, enabling more parallel workers.
- None (default): Auto-detect based on dataset size. Frozen control is enabled when control_matrix serialization would limit workers to <4, AND the required settings (dispersion_scope=’global’, shrink_dispersion=True) are met. For most large datasets (>500K cells), this auto-enables.
- True: Force frozen control mode. Raises ValueError if requirements not met.
- False: Disable frozen control (use raw control matrix).
Memory efficiency: Per-worker pickle size is reduced from (control_n × n_genes × 8) bytes to just ~1MB of sufficient statistics (W_sum, Wz_sum arrays).

Example: For Replogle-GW-k562 (75K control cells × 8K genes): - Without freeze_control: ~4.7 GB per worker → 2 workers max @ 128GB - With freeze_control: ~1 MB per worker → 32 workers @ 128GB - Time reduction: ~300 hours → ~10 hours

Requirements (enforced when True, auto-checked when None): - dispersion_scope=”global” (per-comparison dispersion needs raw matrix) - shrink_dispersion=True (ensures global dispersion is computed) - use_control_cache=True (required for caching)

Technical note: The intercept (β₀) is frozen to the control-only estimate. This is valid because control cells have perturbation indicator = 0, so μ_control = exp(β₀ + offset) is independent of the perturbation effect.
force (bool) – If True, rerun the analysis even when the output h5ad file already exists. If False (default), load and return the existing result instead of rerunning.

Returns:

Differential expression results. Access results via dict-like interface: result[label].effect_size, result[label].pvalue, etc. The h5ad file path is available at result.result_path.

Return type:

crispyx.de.wilcoxon_test(data, *, perturbation_column=None, groupby=None, control_label=None, reference=None, gene_name_column=None, perturbations=None, batch_column=None, min_cells_expressed=0, min_pct_ctrl=0.01, min_pct_pert=0.002, min_pct_both=None, min_mean_ctrl=0.05, min_mean_pert=0.005, chunk_size=None, tie_correct=True, corr_method='benjamini-hochberg', data_name=None, output_path=None, output_dir=None, n_jobs=None, verbose=False, resume=False, checkpoint_interval=None, scanpy_format=False, memory_limit_gb=None, force=False)[source]

Perform a Wilcoxon rank-sum (Mann-Whitney U) test for each gene.

Input data must already be library-size normalised and log-transformed. The function operates directly on the provided matrix without additional preprocessing. As a safeguard, the first sparse chunk is inspected and a warning is emitted if the data appear to be raw counts (integer or count-like floats), encouraging explicit preprocessing upstream.

Parameters:

data (str | Path | AnnData | AnnData) – Path to an h5ad file, or a crispyx/anndata AnnData object containing normalised, log-transformed data.
perturbation_column (str | None) – Column in adata.obs indicating perturbation labels.
groupby (str | None) – Alias for perturbation_column (Scanpy-compatible). Mutually exclusive with perturbation_column.
control_label (str | None) – Label for the control/reference group. If None, infers from common patterns.
reference (str | None) – Alias for control_label (Scanpy-compatible). Mutually exclusive with control_label.
gene_name_column (str | None) – Column in adata.var with gene symbols. If None, uses adata.var_names.
perturbations (Iterable[str] | None) – Specific perturbations to test. If None, tests all non-control groups.
batch_column (str | None) – Column in adata.obs identifying the batch / stratum of each cell (e.g. 10x lane / gem-group). When provided, a batch-stratified (van Elteren) Wilcoxon test is performed: cells are ranked within each batch separately and the per-stratum rank statistics are combined with unit weights (equivalent to summing the Mann-Whitney U statistics across strata). This removes rank inflation caused by batch effects. The low-expression filter, log-fold changes and pts remain pooled across all cells; only the rank test is stratified. When None (default) the standard pooled Wilcoxon test is used.
min_cells_expressed (int) – Minimum total cells (control + perturbation) expressing a gene for testing. Genes below this threshold are assigned p-value=1 and effect_size=0.
min_pct_ctrl (float) – Minimum fraction of expressing cells for the control side. A gene is excluded only when both sides are jointly low. Default 0.01.
min_pct_pert (float) – Minimum fraction of expressing cells for the perturbed side. Default 0.002 (lower than ctrl; induction from near-zero baseline is biologically valid). Combined with min_mean_pert this forms a dual condition more robust than pct alone.
min_pct_both (float | None) – If not None, overrides both min_pct_ctrl and min_pct_pert with the same value.
min_mean_ctrl (float) – Minimum mean log1p expression for the control side. Default 0.05. Excluded genes are written as NaN in score / pvalue / logfoldchanges / effect_size; pts / pts_rest remain populated. Set to 0.0 together with pct thresholds to disable.
min_mean_pert (float) – Minimum mean expression for the perturbed side. Default 0.005.
chunk_size (int | None) – Number of genes to process per chunk (memory vs. speed tradeoff). Smaller values stream more, reducing peak memory at the cost of additional I/O.
tie_correct (bool) – Whether to apply tie correction to the U statistic. Default True for more accurate p-values when ties are present in the data.
corr_method (Literal['benjamini-hochberg', 'bonferroni']) – Method for p-value correction: “benjamini-hochberg” or “bonferroni”.
data_name (str | None) – Custom name for output file. If None, uses “wilcoxon” suffix.
output_path (str | Path | None) – Exact path for the output h5ad file. When provided, output_dir and data_name are ignored.
output_dir (str | Path | None) – Directory for output h5ad file. Defaults to input file’s directory. Deprecated; use output_path instead. Will be removed in the next major version.
n_jobs (int | None) – Number of parallel workers for computing statistics across perturbations. If None, uses all available cores. If 1, runs sequentially.
verbose (int | bool) – If True, show a progress bar for gene chunk processing. Requires tqdm.
resume (bool) – If True and a completed output already exists, load that output as usual. Resuming an interrupted Wilcoxon run from a progress checkpoint is currently disabled because partial chunk arrays are stored in temporary files during execution.
checkpoint_interval (int | None) – Number of gene chunks between checkpoint saves. If None, auto-determined based on dataset size. The checkpoint file <output>.progress.json is written atomically to prevent corruption.
scanpy_format (bool) – If True, write Scanpy-compatible uns['rank_genes_groups'] structure in addition to the layer-based storage. Adds ~2-6 seconds of I/O overhead for large datasets. Default False for performance.
memory_limit_gb (float | None) – Maximum memory budget in GB. Controls whether the streaming path is used for very large datasets. When None (default), detects available system memory via psutil. For HPC environments with fixed allocations, set this to your SLURM --mem value (e.g., memory_limit_gb=128).
force (bool) – If True, rerun the analysis even when the output h5ad file already exists. If False (default), load and return the existing result instead of rerunning.

Returns:

Differential expression results. Access results via dict-like interface: result[label].effect_size, result[label].pvalue, etc. The h5ad file path is available at result.result_path.

Return type:

crispyx.de.shrink_lfc(data, *, output_dir=None, data_name=None, method='stats', prior_scale_mode='global', min_mu=0.0, n_jobs=-1, batch_size=128, profiling=False, memory_limit_gb=None)[source]

Apply apeGLM log-fold change shrinkage to existing NB-GLM results.

This function applies apeGLM shrinkage using a Cauchy prior on the LFC coefficient. Two methods are available:

stats (default, recommended): Uses pre-computed MLE LFC and SE from the h5ad file with vectorized Newton-Raphson optimization. ~35× faster than full and maintains consistency with stored MLE coefficients.
full: Re-loads original count data and runs per-gene L-BFGS-B optimization. May produce different results from stored MLE for lowly expressed genes due to min_mu clamping differences in the likelihood.

Note

The “stats” method is recommended because CRISPYx NB-GLM fitting uses min_mu=0.5 clamping for numerical stability, which affects the stored MLE coefficients. The “full” method re-evaluates the likelihood without this constraint, potentially finding different optima for lowly expressed genes. The “stats” method preserves shrinkage direction (always toward zero) by working directly with the stored statistics.

This enables separating the base NB-GLM fitting from shrinkage for: - Benchmarking: measure base fitting and shrinkage times separately - Flexibility: apply shrinkage to existing results - Speed: use stats method for production

Parameters:

data (str | Path | AnnData | AnnData) – Path to an h5ad file, or a crispyx/anndata AnnData object containing NB-GLM results from nb_glm_test. Must have required layers and metadata in uns.
output_dir (str | Path | None) – Directory for output h5ad file. Defaults to input file’s directory.
data_name (str | None) – Custom name for output file. If None, appends “_shrunk” to input name.
method (Literal['stats', 'full']) –
Shrinkage method to use:
- "stats" (default): Fast vectorized shrinkage using pre-computed MLE statistics. Uses Newton-Raphson optimization across all genes simultaneously. ~35× faster than “full” and maintains consistency with stored MLE coefficients.
- "full": Full model re-fitting with L-BFGS-B per gene. Re-loads original count data. Note: May produce different results for lowly expressed genes due to min_mu clamping differences. Use “stats” for consistent shrinkage behavior.
prior_scale_mode (Literal['global', 'per_comparison']) –
How to estimate the Cauchy prior scale parameter:
- "global" (default): Estimate prior scale once from all perturbations’ MLE LFCs. Faster and often more stable.
- "per_comparison": Estimate prior scale separately for each perturbation vs control comparison. Matches PyDESeq2’s behavior exactly. Use for benchmarking to demonstrate parity.
min_mu (float) – Minimum mean threshold for shrinkage likelihood evaluation. Default: 0.0 (no clamping), matching PyDESeq2’s lfc_shrink which omits min_mu entirely. PyDESeq2 uses min_mu=0.5 for NB-GLM fitting but does NOT pass min_mu to the shrinkage optimizer. This is intentional: the MLE coefficients represent the best fit, and shrinkage should evaluate the same likelihood surface.
n_jobs (int) – Number of parallel jobs for per-gene optimization (only used when method=”full”). Default -1 uses all available cores.
batch_size (int) – Number of genes per joblib batch (only used when method=”full”). Default: 128, matching PyDESeq2.
profiling (bool) – If True, enable timing and memory profiling. When enabled, stores profiling data in adata.uns[“profiling”] with fields: - shrinkage_seconds: Time for lfcShrink operation - shrinkage_peak_memory_mb: Peak memory during shrinkage - profiling_enabled: True When False (default), adata.uns[“profiling”] is set to “NA”.
memory_limit_gb (float | None) – Optional memory budget in gigabytes. When method="full", this limits the number of parallel n_jobs so that joblib workers stay within the budget. When None (default), detects available system memory via psutil.

Returns:

Updated differential expression results with shrunken LFCs. The result h5ad has: - logfoldchange: shrunken LFC values - logfoldchange_raw: original MLE LFC values (preserved) - standard_error: posterior SE reflecting shrinkage uncertainty - X: updated to shrunken LFC (effect_size)

Return type:

Examples

>>> # Fast default (recommended for production)
>>> shrunk = crispyx.de.shrink_lfc("nb_glm_result.h5ad")

>>> # Benchmark-accurate (matches PyDESeq2 exactly)
>>> shrunk = crispyx.de.shrink_lfc(
...     "nb_glm_result.h5ad",
...     method="full",
...     prior_scale_mode="per_comparison",
... )

>>> # First run NB-GLM without shrinkage
>>> result = crispyx.de.nb_glm_test(
...     "data.h5ad",
...     perturbation_column="perturbation",
...     lfc_shrinkage_type="none",  # No shrinkage during fitting
... )
>>> # Then apply shrinkage as a separate step
>>> shrunk_result = crispyx.de.shrink_lfc(result.result_path)

Negative binomial GLM

Generalized linear models utilities for differential expression.

class crispyx.glm.NBGLMResult(coef, se, dispersion, converged, n_iter, deviance, max_cooks=None)[source]

Bases: object

Result of fitting a negative binomial GLM for a single gene.

Parameters:

coef (ndarray)
se (ndarray)
dispersion (float)
converged (bool)
n_iter (int)
deviance (float)
max_cooks (float | None)

coef: ndarray

se: ndarray

dispersion: float

converged: bool

n_iter: int

deviance: float

max_cooks: float | None = None

class crispyx.glm.NBGLMBatchResult(coef, se, dispersion, converged, n_iter, deviance)[source]

Bases: object

Result of fitting NB GLM for multiple genes in a batch.

Parameters:

coef (ndarray)
se (ndarray)
dispersion (ndarray)
converged (ndarray)
n_iter (ndarray)
deviance (ndarray)

coef: ndarray

se: ndarray

dispersion: ndarray

converged: ndarray

n_iter: ndarray

deviance: ndarray

class crispyx.glm.ControlStatisticsCache(control_matrix, control_n, control_offset, beta_intercept, control_dispersion, control_xtwx_intercept, control_xtwz_intercept, control_mean_expr, control_expr_counts, pts_rest, global_size_factors=None, global_dispersion=None, global_dispersion_trend=None, global_disp_prior_var=None, dispersion_scope=None, frozen_control_W_sum=None, frozen_control_Wz_sum=None, frozen_control_mu_sum=None, frozen_control_resid_sq_sum=None, frozen_control_Y_sum=None, use_frozen_control=False)[source]

Bases: object

Cached statistics for control cells to avoid redundant computation.

When fitting independent NB-GLM models for multiple perturbations, each comparison includes the same control cells. This cache precomputes and stores control cell contributions to the IRLS normal equations, allowing them to be reused across all perturbation comparisons instead of being redundantly computed.

The cache stores: - Control cell intercept (β₀): baseline log-expression per gene - Control dispersion: estimated from control cells only - XᵀWX contribution from control cells (for intercept column) - XᵀWz contribution from control cells - Global size factors (optional): precomputed on all cells for consistency - Global dispersion (optional): precomputed MAP dispersion using all cells - Global dispersion prior variance: for MAP shrinkage

Memory optimization: We store control_matrix as dense (not mu/weights) since mu and weights change during IRLS and must be recomputed anyway. Storing the dense matrix avoids repeated .toarray() calls in workers.

This reduces IRLS complexity from O(n_perturbations × n_control × n_genes × n_iter) to O(n_control × n_genes × n_iter) for control-related computations.

Parameters:

control_matrix (ndarray)
control_n (int)
control_offset (ndarray)
beta_intercept (ndarray)
control_dispersion (ndarray)
control_xtwx_intercept (ndarray)
control_xtwz_intercept (ndarray)
control_mean_expr (ndarray)
control_expr_counts (ndarray)
pts_rest (ndarray)
global_size_factors (ndarray | None)
global_dispersion (ndarray | None)
global_dispersion_trend (ndarray | None)
global_disp_prior_var (float | None)
dispersion_scope (str | None)
frozen_control_W_sum (ndarray | None)
frozen_control_Wz_sum (ndarray | None)
frozen_control_mu_sum (ndarray | None)
frozen_control_resid_sq_sum (ndarray | None)
frozen_control_Y_sum (ndarray | None)
use_frozen_control (bool)

control_matrix: ndarray

control_n: int

control_offset: ndarray

beta_intercept: ndarray

control_dispersion: ndarray

control_xtwx_intercept: ndarray

control_xtwz_intercept: ndarray

control_mean_expr: ndarray

control_expr_counts: ndarray

pts_rest: ndarray

global_size_factors: ndarray | None = None

global_dispersion: ndarray | None = None

global_dispersion_trend: ndarray | None = None

global_disp_prior_var: float | None = None

dispersion_scope: str | None = None

frozen_control_W_sum: ndarray | None = None

frozen_control_Wz_sum: ndarray | None = None

frozen_control_mu_sum: ndarray | None = None

frozen_control_resid_sq_sum: ndarray | None = None

frozen_control_Y_sum: ndarray | None = None

use_frozen_control: bool = False

crispyx.glm.precompute_control_statistics(control_matrix, control_offset, *, max_iter=10, tol=1e-06, min_mu=0.5, dispersion_method='moments', global_size_factors=None, freeze_control=False)[source]

Precompute control cell statistics for reuse across perturbation comparisons.

This function fits a simple intercept-only model to control cells to estimate the baseline expression level (β₀) per gene. The resulting intercept, mean expression, weights, and XᵀWX/XᵀWz contributions are cached for reuse.

Parameters:

control_matrix (ndarray | csr_matrix) – Expression matrix for control cells, shape (n_control, n_genes).
control_offset (ndarray) – Log size factors for control cells, shape (n_control,).
max_iter (int) – Maximum IRLS iterations for intercept estimation.
tol (float) – Convergence tolerance.
min_mu (float) – Minimum fitted mean value.
dispersion_method (Literal['moments', 'cox-reid']) – Method for dispersion estimation.
global_size_factors (ndarray | None) – Optional precomputed size factors for all cells (n_cells_total,). When provided, stored in cache for use across all comparisons.
freeze_control (bool) – If True, compute frozen sufficient statistics (W_sum, Wz_sum, etc.) and set control_matrix to None to save memory. This reduces per-worker pickle size from ~5GB to ~1MB for large datasets. Workers can use the frozen stats with fit_batch_with_frozen_control() instead of the raw matrix. Default False.

Returns:

Cached statistics for control cells.

Return type:

crispyx.glm.precompute_control_statistics_streaming(path, control_mask, control_offset, *, max_iter=10, tol=1e-06, min_mu=0.5, global_size_factors=None, freeze_control=True, chunk_size=4096)[source]

Streaming version of precompute_control_statistics for very large control groups.

Instead of densifying the full control matrix (which can exceed 100+ GiB for large datasets), this function reads control cells from disk in chunks and accumulates sufficient statistics for the intercept-only IRLS model.

Peak memory is O(chunk_size × n_genes) instead of O(n_control × n_genes).

Parameters:

path (str | Path) – Path to the h5ad file containing the count matrix.
control_mask (np.ndarray) – Boolean mask over all cells indicating control cells, shape (n_cells,).
control_offset (np.ndarray) – Log size factors for control cells, shape (n_control,).
max_iter (int) – Maximum IRLS iterations for intercept estimation.
tol (float) – Convergence tolerance.
min_mu (float) – Minimum fitted mean value.
global_size_factors (np.ndarray | None) – Optional precomputed size factors for all cells (n_cells_total,).
freeze_control (bool) – Must be True for streaming mode (raw control_matrix is never materialised).
chunk_size (int) – Number of control cells to process per chunk. Default 4096.

Returns:

Cached statistics with frozen control sufficient statistics.

Return type:

crispyx.glm.precompute_global_dispersion(control_cache, all_cell_matrix, all_cell_offset, *, n_grid=25, min_disp=1e-08, max_disp=10.0, fit_type='parametric', fast_mode=True, max_dense_fraction=0.3, memory_limit_gb=None)[source]

Precompute global dispersion trend using all cells and update cache.

This function computes a global dispersion trend using all cells in the dataset (control + all perturbations), similar to how DESeq2/PyDESeq2 estimates dispersion from all samples. The trend is then used for MAP shrinkage in all per-perturbation comparisons.

Using global dispersion has several advantages: 1. More stable estimates (larger sample size for trend fitting) 2. ~10× faster per-perturbation fitting (skips MAP estimation) 3. More consistent results across perturbations

Memory-adaptive behavior: If the estimated memory for densifying the matrix exceeds max_dense_fraction × min(available_memory, memory_limit_gb), the function switches to chunk-wise streaming processing.

Parameters:

control_cache (ControlStatisticsCache) – Precomputed control cell statistics cache.
all_cell_matrix (ndarray | csr_matrix) – Count matrix for all cells, shape (n_cells, n_genes).
all_cell_offset (ndarray) – Log size factors for all cells, shape (n_cells,).
n_grid (int) – Number of grid points for dispersion MAP estimation.
min_disp (float) – Minimum dispersion value.
max_disp (float) – Maximum dispersion value.
fit_type (Literal['parametric', 'local', 'mean']) – Type of trend fitting (“parametric”, “local”, or “mean”).
fast_mode (bool) – If True, use simple trend shrinkage (MoM → trend) instead of expensive MAP estimation. This is ~50× faster and suitable for most use cases. If False, use full MAP estimation with grid search + refinement.
max_dense_fraction (float) – Maximum fraction of available memory to use for dense matrix. If estimated memory exceeds this, switch to streaming mode. Default is 0.3 (30% of available memory).
memory_limit_gb (float | None) – Optional explicit memory limit in GB. If provided, the effective memory budget is min(available_memory, memory_limit_gb).

Returns:

Updated cache with global_dispersion, global_dispersion_trend, and global_disp_prior_var fields populated.

Return type:

crispyx.glm.precompute_global_dispersion_from_path(path, control_cache, all_cell_offset, *, min_disp=1e-08, max_disp=10.0, fit_type='parametric', chunk_size=4096)[source]

Path-based streaming global dispersion for very large datasets.

This function estimates dispersion by streaming directly from an h5ad file, avoiding the need to load the entire matrix into memory. Uses method-of-moments estimation accumulated across chunks read from disk.

This is the preferred method for datasets that exceed available memory (e.g., Replogle-GW-k562 with ~2M cells × 8K genes = ~131 GB).

Parameters:

path (str | Path) – Path to the h5ad file containing the count matrix.
control_cache (ControlStatisticsCache) – Precomputed control cell statistics cache.
all_cell_offset (np.ndarray) – Log size factors for all cells, shape (n_cells,).
min_disp (float) – Minimum dispersion value.
max_disp (float) – Maximum dispersion value.
fit_type (Literal['parametric', 'local', 'mean']) – Type of trend fitting (“parametric”, “local”, or “mean”).
chunk_size (int) – Number of cells to process per chunk. Larger chunks are faster but use more memory. Default 4096 balances speed and memory.

Returns:

Updated cache with global_dispersion, global_dispersion_trend, and global_disp_prior_var fields populated.

Return type:

class crispyx.glm.NBGLMFitter(design, *, offset=None, dispersion=None, max_iter=50, tol=1e-06, poisson_init_iter=20, ridge_penalty=1e-06, min_mu=0.5, min_total_count=1.0, compute_cooks=False, dispersion_method='cox-reid')[source]

Bases: object

L-BFGS-B solver for negative binomial GLMs.

Parameters:

design (ArrayLike) – The design matrix with shape (n_samples, n_features).
offset (ArrayLike | None) – Optional log-scale offset (e.g. library size) per sample.
dispersion (float | None) – Optional dispersion (alpha) for the negative binomial. If None the dispersion is re-estimated at each iteration using a method-of-moments update similar to the one used in statsmodels.
max_iter (int) – Maximum number of outer iterations for alternating optimization.
tol (float) – Absolute tolerance on the coefficient updates for convergence.
poisson_init_iter (int) – Maximum number of iterations for the Poisson initialisation stage. If set to 0 the Poisson warm start is skipped and coefficients are initialised at zero.
ridge_penalty (float) – Small diagonal ridge penalty added to the weighted normal equations to improve numerical stability. This does not change the estimator when the system is well conditioned but prevents failures when the Hessian is nearly singular.
min_mu (float) – Lower bound on the fitted mean to avoid issues with extremely small predicted counts for lowly expressed genes.
min_total_count (float) – Genes with a total count below this threshold are not fitted (the resulting NBGLMResult will report converged=False).
dispersion_method (Literal['moments', 'cox-reid']) –
Method for estimating dispersion when dispersion is None.
- "moments": Method-of-moments (fast but less accurate).
- "cox-reid": Cox-Reid adjusted profile likelihood (slower but more accurate, similar to DESeq2).
compute_cooks (bool)

fit_gene(counts)[source]

Fit NB GLM for a single gene using L-BFGS-B optimization.

Parameters:: counts (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes])
Return type:: NBGLMResult

fit_matrix(matrix, *, batch_size=None)[source]

Fit NB GLM for every gene (column) in a count matrix.

Parameters:

matrix (array-like of shape (n_samples, n_genes)) – Raw count matrix. Sparse (CSC) and dense formats are accepted.
batch_size (int or None, optional) – Number of genes to densify at once when matrix is sparse. None processes all genes in one batch.

Returns:

One result per gene, in column order.

Return type:

list of NBGLMResult

estimate_dispersion_cox_reid(y, mu, weights, *, initial_alpha=0.1, bounds=(1e-08, 1000.0))[source]

Estimate dispersion using Cox-Reid adjusted profile likelihood.

This method maximizes the adjusted profile log-likelihood for the dispersion parameter, which includes a bias correction term based on the Cox-Reid adjustment. This approach is similar to DESeq2’s dispersion estimation.

Parameters:

y (ndarray) – Count vector of shape (n_samples,).
mu (ndarray) – Current fitted mean values of shape (n_samples,).
weights (ndarray) – IRLS weights from the current fit.
initial_alpha (float) – Starting value for the optimization.
bounds (tuple[float, float]) – Lower and upper bounds for the dispersion parameter.

Returns:

Estimated dispersion parameter.

Return type:

float

crispyx.glm.build_design_matrix(obs_frame, *, covariate_columns, perturbation_indicator, intercept=True)[source]

Construct a design matrix from covariates and a perturbation indicator.

Parameters:

obs_frame – Pandas DataFrame (preferred) or structured numpy array containing covariate columns.
covariate_columns (Sequence[str]) – Columns that should be included in the design matrix. Categorical columns are expanded using one-hot encoding (dropping the first level).
perturbation_indicator (ndarray) – Binary array of shape (n_samples,) marking perturbed cells.
intercept (bool) – Whether to prepend an intercept column to the design matrix.

Returns:

design – The numeric design matrix as a numpy.ndarray of float64.
column_names – The column names corresponding to the design matrix.

Return type:

tuple[ndarray, list[str]]

crispyx.glm.fit_dispersion_trend(means, dispersions, *, min_mean=0.5, fit_type='parametric', n_iter=10)[source]

Fit a smooth mean-dispersion trend using DESeq2/PyDESeq2-style Gamma GLM.

The parametric trend models dispersion as dispersion = asymptDisp + extraPois / mean.

This matches PyDESeq2’s fitDispersionTrend which uses iteratively reweighted least squares (IRLS) with a Gamma family and log link, fitting E[disp] = a0 + a1 / mean.

Outliers are iteratively removed based on prediction ratio bounds.

Parameters:

means (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) – Mean expression values per gene (normalized counts).
dispersions (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) – Raw genewise dispersion estimates.
min_mean (float) – Minimum mean value for fitting (genes below this are excluded).
fit_type (Literal['parametric', 'local', 'mean']) – Type of trend fitting: - “parametric”: DESeq2/PyDESeq2-style Gamma GLM (recommended) - “local”: Weighted local regression (LOWESS-like) - “mean”: Simple median (fallback)
n_iter (int) – Number of iterations for IRLS fitting.

Returns:

Fitted trend values for each gene.

Return type:

np.ndarray

crispyx.glm.shrink_dispersions(raw, trend, *, prior_df=None, min_prior_df=1.0, max_prior_df=100.0, outlier_sigma=2.0, n_iter=5)[source]

Shrink dispersions toward fitted trend using empirical Bayes.

This implements a DESeq2/PyDESeq2-style empirical Bayes shrinkage where the prior variance is estimated from the distribution of log-dispersion residuals around the trend using an iterative trimmed variance estimator.

Genes with dispersions more than outlier_sigma standard deviations above the trend keep their MLE (not shrunken), matching PyDESeq2’s outlier handling.

The shrinkage formula is:: log(shrunk) = (prior_df * log(trend) + log(raw)) / (prior_df + 1)

This is equivalent to a posterior mean estimate under a log-normal prior.

Parameters:

raw (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) – Raw MLE dispersion estimates.
trend (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) – Fitted mean-dispersion trend values.
prior_df (float | None) – Prior degrees of freedom controlling shrinkage strength. If None, estimated empirically from the data using iterative trimming.
min_prior_df (float) – Minimum allowed prior degrees of freedom.
max_prior_df (float) – Maximum allowed prior degrees of freedom.
outlier_sigma (float) – Number of standard deviations above trend beyond which genes keep their MLE dispersion (not shrunken). Set to inf to disable.
n_iter (int)

Returns:

Shrunken dispersion estimates.

Return type:

np.ndarray

crispyx.glm.estimate_dispersion_map(Y, mu, trend, *, prior_var=None, min_disp=1e-08, max_disp=10.0, n_grid=25, refine=True, n_jobs=-1)[source]

Estimate MAP dispersion using vectorized grid search + optional refinement.

This implements PyDESeq2-style MAP estimation where the dispersion is estimated by maximizing log L(Y | mu, alpha) + log prior(alpha | trend, prior_var).

The prior is log-normal: log(alpha) ~ N(log(trend), prior_var).

Optimization (v5): Uses fused Numba kernel that combines grid search with Brent’s method refinement in a single parallel pass over genes. This eliminates joblib process spawning overhead, achieving ~2-3× speedup over the previous joblib-based refinement while maintaining identical accuracy to scipy’s minimize_scalar.

Default is n_grid=25, refine=True which provides optimal balance of speed and accuracy. The Brent refinement makes grid size largely irrelevant for accuracy (all grid sizes 15-50 achieve perfect correlation).

Parameters:

Y (ndarray) – Count matrix of shape (n_cells, n_genes).
mu (ndarray) – Fitted mean matrix of shape (n_cells, n_genes).
trend (ndarray) – Dispersion trend values of shape (n_genes,).
prior_var (float | None) – Variance of the log-normal prior. If None, estimated from data using the variance of log-dispersion residuals around trend.
min_disp (float) – Minimum allowed dispersion value.
max_disp (float) – Maximum allowed dispersion value.
n_grid (int) – Number of grid points for initial search. More points = better initial estimate but slower grid search. Default is 50 for good accuracy (96% Top-100 overlap with PyDESeq2 without refinement).
refine (bool) – If True, refine the grid search result using Brent’s method. Default is True for best accuracy (99% Top-100 overlap with PyDESeq2). Set to False for ~2× speedup if slight accuracy loss is acceptable.
n_jobs (int) – Number of parallel jobs for refinement. -1 uses all cores.

Returns:

MAP dispersion estimates of shape (n_genes,).

Return type:

np.ndarray

crispyx.glm.shrink_lfc_apeglm(counts, design_matrix, size_factors, dispersion, mle_coef, mle_se, *, shrink_index=1, prior_scale=None, prior_no_shrink_scale=15.0, max_iter=100, tol=1e-06, n_jobs=-1, batch_size=128, min_mu=0.0)[source]

Apply apeGLM LFC shrinkage using Cauchy prior (PyDESeq2-compatible).

This implements the apeGLM (approximate posterior estimation for GLM) shrinkage method used by DESeq2/PyDESeq2. The method re-fits the NB-GLM model with a Cauchy prior penalty on the LFC coefficient.

Key features matching PyDESeq2: - L-BFGS-B optimization with analytical gradients - Grid search fallback over [-5, 5] with 50 points when optimization fails - Parallel per-gene optimization using joblib with batch_size=128 - Proper NB likelihood formulation with numerical stability - min_mu clamping for consistency with NB-GLM fitting (if min_mu > 0)

Parameters:

counts (ndarray) – Raw count matrix (n_cells, n_genes).
design_matrix (ndarray) – Design matrix (n_cells, n_params).
size_factors (ndarray) – Size factors for each cell (n_cells,).
dispersion (ndarray) – Gene-wise dispersion estimates (n_genes,).
mle_coef (ndarray) – MLE coefficient matrix (n_params, n_genes) for warm-starting.
mle_se (ndarray) – Standard errors of MLE LFC estimates (n_genes,).
shrink_index (int) – Index of the coefficient to shrink (default: 1, the LFC coefficient).
prior_scale (float | None) – Scale parameter for Cauchy prior. If None, estimated globally from the MLE LFC distribution (matching PyDESeq2’s approach).
prior_no_shrink_scale (float) – Scale for normal prior on non-shrunk coefficients (default: 15.0).
max_iter (int) – Maximum iterations for L-BFGS-B optimization.
tol (float) – Convergence tolerance.
n_jobs (int) – Number of parallel jobs. Default -1 uses all available cores.
batch_size (int) – Number of genes per joblib batch (default: 128, matching PyDESeq2).
min_mu (float) – Minimum mean value for mu clamping in NB log-likelihood. If > 0, mu is clamped to be at least min_mu. This should match the min_mu used during NB-GLM fitting to ensure the stored coefficients are consistent with the likelihood surface (default: 0.0 = no clamping).

Returns:

shrunk_coef – Shrunken coefficient matrix (n_params, n_genes).
shrunk_se – Approximate standard errors from inverse Hessian (n_genes,).
converged – Boolean array indicating convergence for each gene (n_genes,).

Return type:

tuple[ndarray, ndarray, ndarray]

crispyx.glm.shrink_lfc_apeglm_from_stats(mle_lfc, mle_se, xtwx_diag=None, *, base_mean=None, prior_scale=None, max_iter=100, tol=1e-08, use_gene_specific_prior=True, hybrid_fallback=True, hybrid_mle_se_threshold=3.0, hybrid_base_mean_threshold=10.0)[source]

Apply apeGLM-style shrinkage using pre-computed MLE statistics.

This is a memory-efficient, fully vectorized version of apeGLM shrinkage that uses pre-computed MLE coefficients and standard errors without requiring access to the full count matrix. It applies a Cauchy prior and finds the MAP estimate using a damped Newton-Raphson optimization across all genes simultaneously.

Accuracy improvements (v2):

Gene-specific prior scales based on expression level (sqrt(base_mean)).
Moment-corrected SE using observed Fisher information approximation.
Hybrid fallback: marks genes with abs(MLE)/SE > threshold or low expression for full NB-GLM re-fitting to achieve Eff rho >= 0.98.

The posterior mode is found by solving beta_MAP = argmin { (beta - mle_lfc)^2 / (2 * se^2) + log(1 + (beta/s)^2) } which is the negative log posterior with a Cauchy prior.

This implementation uses a robust damped Newton method with Hessian clamping to ensure convergence even when the standard Hessian becomes negative (which can happen when abs(beta) > prior_scale). This matches PyDESeq2’s behavior which uses L-BFGS-B for robustness.

Parameters:

mle_lfc (ndarray) – MLE log-fold change estimates (n_genes,).
mle_se (ndarray) – Standard errors of MLE estimates (n_genes,).
xtwx_diag (ndarray | None) – Diagonal elements of X’WX for Fisher information (n_genes,). If provided, used for posterior SE computation. If None, uses approximation based on MLE SE.
base_mean (ndarray | None) – Mean normalized expression per gene (n_genes,). Used for gene-specific prior scaling. If None, uniform prior scale is used.
prior_scale (float | None) – Base scale parameter for Cauchy prior. If None, estimated adaptively. When use_gene_specific_prior=True, this is scaled per-gene.
max_iter (int) – Maximum iterations for optimization.
tol (float) – Convergence tolerance.
use_gene_specific_prior (bool) – If True, scale prior_scale by 1/sqrt(base_mean) per gene. This accounts for the fact that lowly-expressed genes have higher variance and should be shrunk more aggressively. Default True.
hybrid_fallback (bool) – If True, return a mask indicating genes that need full NB-GLM re-fitting due to problematic statistics (large abs(MLE)/SE or low expression).
hybrid_mle_se_threshold (float) – Genes with abs(MLE)/SE > this threshold are marked for full re-fitting.
hybrid_base_mean_threshold (float) – Genes with base_mean < this threshold are marked for full re-fitting.

Returns:

shrunk_lfc – Shrunken log-fold change estimates (n_genes,).
shrunk_se – Posterior standard errors (n_genes,).
converged – Boolean array indicating convergence for each gene (n_genes,).
needs_full_refit – Boolean array indicating genes that need full NB-GLM re-fitting (n_genes,). Only populated when hybrid_fallback=True, otherwise all False.

Return type:

tuple[ndarray, ndarray, ndarray, ndarray]

crispyx.glm.compute_cooks_distance_batch(Y, mu, dispersion, n_params=2)[source]

Compute Cook’s distance for each observation in a batch of genes.

Cook’s distance measures the influence of each observation on the fitted model. Large values indicate potential outliers that disproportionately affect the estimates.

For GLMs, Cook’s distance is computed as:: D_i = (r_i^2 / p) * (h_ii / (1 - h_ii)^2)

where r_i is the Pearson residual, p is the number of parameters, and h_ii is the leverage (diagonal of the hat matrix).

Parameters:

Y (ndarray) – Count matrix of shape (n_cells, n_genes).
mu (ndarray) – Fitted mean matrix of shape (n_cells, n_genes).
dispersion (ndarray) – Dispersion estimates of shape (n_genes,).
n_params (int) – Number of model parameters (default 2 for intercept + treatment).

Returns:

Cook’s distance matrix of shape (n_cells, n_genes).

Return type:

np.ndarray

crispyx.glm.filter_outliers_cooks(Y, mu, dispersion, *, n_params=2, threshold_quantile=0.99)[source]

Identify and replace outlier counts based on Cook’s distance.

Following DESeq2’s approach: 1. Compute Cook’s distance for each observation 2. Identify outliers where Cook’s D > F(threshold_quantile, p, n-p) 3. Replace outlier counts with trimmed mean from non-outlier samples

Parameters:

Y (ndarray) – Count matrix of shape (n_cells, n_genes).
mu (ndarray) – Fitted mean matrix of shape (n_cells, n_genes).
dispersion (ndarray) – Dispersion estimates of shape (n_genes,).
n_params (int) – Number of model parameters.
threshold_quantile (float) – Quantile of F distribution for outlier threshold.

Returns:

Y_filtered: Count matrix with outliers replaced
outlier_mask: Boolean matrix indicating outliers (n_cells, n_genes)

Return type:

Tuple[np.ndarray, np.ndarray]

crispyx.glm.estimate_covariate_effects_streaming(backed_adata, *, obs_df, perturbation_labels, control_label, covariate_columns, size_factors, chunk_size=2048, poisson_iter=10, tol=1e-06, return_intercept=False)[source]

Estimate global intercept and covariate effects using control cells only.

This function fits a Poisson regression using only control cells to estimate: - An intercept (baseline expression for control cells) - Covariate effects (if any covariates are specified)

By using only control cells, the intercept represents the true control baseline that can then be used as an offset in per-perturbation fitting. This ensures that perturbation effects are properly estimated as deviations from control.

For the intercept-only case (no covariates), the closed-form MLE is used:: intercept = log(sum(Y) / sum(size_factors))

When covariates are present, IRLS is used with proper per-gene weighting.

Parameters:

backed_adata – Backed AnnData object opened in read mode.
obs_df (pd.DataFrame) – Full obs DataFrame with all cells (already loaded).
perturbation_labels (np.ndarray) – Array of perturbation labels for all cells.
control_label (str) – The label identifying control cells.
covariate_columns (Sequence[str]) – List of covariate column names to include.
size_factors (np.ndarray) – Per-cell size factors (length n_cells).
chunk_size (int) – Number of cells to process per chunk.
poisson_iter (int) – Number of Poisson IRLS iterations.
tol (float) – Convergence tolerance.
return_intercept (bool) – If True, also return the global intercept coefficients. This is useful for joint fitting where the intercept should be shared across all perturbation comparisons.

Returns:

If return_intercept is False:: Covariate effects of shape (n_covariates, n_genes). These are the log-scale regression coefficients for each covariate.
If return_intercept is True:: Tuple of (covariate_effects, intercept) where: - covariate_effects has shape (n_covariates, n_genes) - intercept has shape (n_genes,) representing the control baseline

Return type:

np.ndarray or tuple

crispyx.glm.estimate_global_dispersion_streaming(backed_adata, *, obs_df, perturbation_labels, control_label, covariate_columns, size_factors, beta_intercept, beta_cov=None, beta_perturbation=None, chunk_size=2048, dispersion_method='cox-reid', poisson_iter=10, tol=1e-06)[source]

Estimate global per-gene dispersion using all cells via streaming.

This function streams through all cells to estimate dispersion for each gene using a full design matrix. The dispersion is estimated using all conditions together, which provides more stable estimates than per-perturbation estimation, similar to how PyDESeq2 estimates dispersion from all samples.

The function uses the pre-estimated intercept, perturbation effects, and covariate effects to compute fitted values (mu), then estimates dispersion from the residuals using method-of-moments.

Parameters:

backed_adata – Backed AnnData object opened in read mode.
obs_df (pd.DataFrame) – Full obs DataFrame with all cells (already loaded).
perturbation_labels (np.ndarray) – Array of perturbation labels for all cells.
control_label (str) – The label identifying control cells (used as reference level).
covariate_columns (Sequence[str]) – List of covariate column names to include.
size_factors (np.ndarray) – Per-cell size factors (length n_cells).
beta_intercept (np.ndarray) – Pre-estimated global intercept coefficients, shape (n_genes,).
beta_cov (np.ndarray | None) – Pre-estimated covariate effects, shape (n_covariates, n_genes). If None or empty, no covariate adjustment is applied.
beta_perturbation (np.ndarray | None) – Pre-estimated perturbation effects, shape (n_perturbations, n_genes). If None, perturbation effects are estimated via Poisson IRLS.
chunk_size (int) – Number of cells to process per chunk.
dispersion_method (Literal['moments', 'cox-reid']) – Method for dispersion estimation: - “moments”: Method-of-moments (fast but less accurate) - “cox-reid”: Cox-Reid adjusted profile likelihood (more accurate)
poisson_iter (int) – Number of Poisson IRLS iterations for refining mu estimates. Only used if beta_perturbation is None.
tol (float) – Convergence tolerance.

Returns:

Dispersion estimates of shape (n_genes,). These are the alpha values for the negative binomial distribution: Var(Y) = mu + alpha * mu^2.

Return type:

np.ndarray

class crispyx.glm.NBGLMBatchFitter(design, *, offset=None, max_iter=25, tol=1e-06, poisson_init_iter=5, dispersion_method='cox-reid', min_mu=0.5, min_total_count=1.0, ridge_penalty=1e-06)[source]

Bases: object

Vectorized batch fitter for NB GLM across multiple genes.

This fitter processes all genes simultaneously using vectorized operations, providing significant speedup compared to per-gene fitting. It uses IRLS (Iteratively Reweighted Least Squares) with batched matrix operations.

Parameters:

design (ArrayLike) – Design matrix with shape (n_samples, n_features).
offset (ArrayLike | None) – Log-scale offset (e.g., log size factors) per sample.
max_iter (int) – Maximum IRLS iterations.
tol (float) – Convergence tolerance on coefficient updates.
poisson_init_iter (int) – Initial Poisson iterations for warm start.
dispersion_method (Literal['moments', 'cox-reid']) – Method for dispersion estimation: “moments” or “cox-reid”.
min_mu (float) – Minimum fitted mean to avoid numerical issues.
min_total_count (float) – Minimum total count for a gene to be fitted.
ridge_penalty (float)

fit_batch(counts, gene_batch_size='auto', use_numba=True)[source]

Fit NB GLM for all genes in the count matrix.

Memory-optimized implementation with optional gene batching and Numba acceleration for the 2-parameter model (intercept + perturbation).

Parameters:

counts (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) – Count matrix of shape (n_samples, n_genes).
gene_batch_size (int | Literal['auto'] | None) – Number of genes to process per batch. If “auto”, calculated based on memory constraints (~100 MB per batch). If None, process all genes at once (legacy behavior).
use_numba (bool) – Whether to use Numba-accelerated IRLS for 2-feature models. Default True for better memory efficiency.

Returns:

Results for all genes with vectorized arrays.

Return type:

fit_batch_with_covariate_offset(counts, covariate_offset)[source]

Fit NB GLM with pre-computed covariate offset.

This method is used in the joint fitting approach where covariate effects are estimated globally and then held fixed during per-perturbation fitting. The covariate offset is subtracted from the working response during IRLS.

Parameters:

counts (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) – Count matrix of shape (n_samples, n_genes).
covariate_offset (ndarray) – Pre-computed covariate offset of shape (n_samples, n_genes), representing X_cov @ beta_cov for the covariate portion of the design.

Returns:

Results for all genes with vectorized arrays.

Return type:

fit_batch_with_joint_offsets(counts, *, intercept_offset=None, covariate_offset=None, fixed_dispersion=None)[source]

Fit NB GLM with pre-computed intercept and covariate offsets.

This method is used in the joint fitting approach where the global intercept (and optionally covariates and dispersion) are estimated using all cells and then held fixed during per-perturbation fitting.

The design matrix should NOT include an intercept column when using intercept_offset, as the global intercept is added as an offset.

Parameters:

counts (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) – Count matrix of shape (n_samples, n_genes).
intercept_offset (ndarray | None) – Pre-computed global intercept of shape (n_genes,). If provided, the design matrix should not include an intercept column.
covariate_offset (ndarray | None) – Pre-computed covariate offset of shape (n_samples, n_genes), representing X_cov @ beta_cov for the covariate portion.
fixed_dispersion (ndarray | None) – If provided, use these dispersion values instead of estimating. Shape (n_genes,).

Returns:

Results for all genes with vectorized arrays.

Return type:

fit_batch_with_control_cache(perturbation_matrix, perturbation_offset, control_cache, *, perturbation_indicator, valid_mask=None)[source]

Fit NB GLM using precomputed control cell statistics.

This method provides significant speedup by reusing control cell contributions (XᵀWX, XᵀWz) from a precomputed cache instead of redundantly computing them for each perturbation comparison.

The design matrix is [1, perturbation_indicator] where: - Control cells have indicator = 0 - Perturbation cells have indicator = 1

The control contribution to XᵀWX and XᵀWz is taken from the cache, and only perturbation cell contributions are computed fresh.

Parameters:

perturbation_matrix (ndarray | csr_matrix) – Expression matrix for perturbation cells only, shape (n_pert, n_genes).
perturbation_offset (ndarray) – Log size factors for perturbation cells, shape (n_pert,).
control_cache (ControlStatisticsCache) – Precomputed control cell statistics from precompute_control_statistics.
perturbation_indicator (ndarray) – Binary indicator for perturbation cells in the combined design. Should be shape (n_control + n_pert,) with 0 for control, 1 for perturbation.
valid_mask (ndarray | None) – Optional boolean mask for genes to fit, shape (n_genes,).

Returns:

Fitted coefficients and statistics.

Return type:

fit_batch_with_frozen_control(perturbation_matrix, perturbation_offset, control_cache, *, valid_mask=None)[source]

Fit NB GLM using frozen control sufficient statistics (memory-efficient).

This method uses precomputed sufficient statistics from control cells instead of the raw control_matrix, reducing per-worker memory from ~5GB to ~1MB for large datasets.

Key differences from fit_batch_with_control_cache: - β₀ (intercept) is FROZEN to the value estimated from control cells - Only β₁ (perturbation effect) is estimated - Control contributions (W_sum, Wz_sum) are pre-computed constants - No access to raw control_matrix (control_cache.control_matrix is None)

The design matrix is [1, perturbation_indicator] where: - Control cells have indicator = 0 (contributions are frozen) - Perturbation cells have indicator = 1

Parameters:

perturbation_matrix (ndarray | csr_matrix) – Expression matrix for perturbation cells only, shape (n_pert, n_genes).
perturbation_offset (ndarray) – Log size factors for perturbation cells, shape (n_pert,).
control_cache (ControlStatisticsCache) – Precomputed control cell statistics with use_frozen_control=True. Must have frozen_control_W_sum and frozen_control_Wz_sum set.
valid_mask (ndarray | None) – Optional boolean mask for genes to fit, shape (n_genes,).

Returns:

Fitted coefficients and statistics.

Return type: