dire_rapids.metrics module

Performance metrics for dimensionality reduction evaluation.

This module provides GPU-accelerated metrics using RAPIDS cuML for evaluating the quality of dimensionality reduction embeddings, including:

  • Distortion metrics (stress)

  • Context preservation metrics (SVM, kNN classification)

  • Topological metrics (persistence homology, Betti curves)

The module supports multiple backends for persistence computation: - giotto-ph (fastest CPU, multi-threaded) - ripser++ (GPU-accelerated) - ripser (CPU fallback)

dire_rapids.metrics.get_available_persistence_backends()[source]

Get list of available persistence computation backends.

Returns:

dict

Return type:

Dictionary mapping backend names to availability status

dire_rapids.metrics.set_persistence_backend(backend)[source]

Set the persistence computation backend.

Parameters:

backend (str or None) – Backend to use: ‘giotto-ph’, ‘ripser++’, ‘ripser’, or None for auto-selection

Raises:

ValueError – If specified backend is not available

dire_rapids.metrics.get_persistence_backend()[source]

Get the current persistence backend (with auto-selection if None).

Returns:

str

Return type:

Name of the selected backend

Raises:

RuntimeError – If no persistence backend is available

dire_rapids.metrics.welford_update_gpu(count, mean, M2, new_value, finite_threshold=1000000000000.0)[source]

GPU-accelerated Welford’s algorithm update step.

Parameters:
  • count (cupy.ndarray) – Running count of valid values

  • mean (cupy.ndarray) – Running mean

  • M2 (cupy.ndarray) – Running sum of squared differences

  • new_value (cupy.ndarray) – New values to incorporate

  • finite_threshold (float) – Maximum magnitude for inclusion

Returns:

tuple

Return type:

Updated (count, mean, M2)

dire_rapids.metrics.welford_finalize_gpu(count, mean, M2)[source]

Finalize Welford’s algorithm to compute mean and std.

Parameters:
  • count (cupy.ndarray) – Total count of valid values

  • mean (cupy.ndarray) – Computed mean

  • M2 (cupy.ndarray) – Sum of squared differences

Returns:

tuple

Return type:

(mean, std)

dire_rapids.metrics.welford_gpu(data)[source]

GPU-accelerated computation of mean and std using Welford’s algorithm.

Parameters:

data (cupy.ndarray) – Input data

Returns:

tuple

Return type:

(mean, std)

dire_rapids.metrics.threshold_subsample_gpu(data, layout, labels=None, threshold=0.5, random_state=42)[source]

GPU-accelerated Bernoulli subsampling of data.

Parameters:
  • data (array-like) – High-dimensional data

  • layout (array-like) – Low-dimensional embedding

  • labels (array-like, optional) – Data labels

  • threshold (float) – Probability of keeping each sample (must be between 0.0 and 1.0)

  • random_state (int) – Random seed

Returns:

tuple

Return type:

Subsampled arrays

Raises:

ValueError – If threshold is not between 0.0 and 1.0

dire_rapids.metrics.make_knn_graph_gpu(data, n_neighbors, batch_size=50000)[source]

GPU-accelerated kNN graph construction using cuML.

Parameters:
  • data (array-like) – Data points (n_samples, n_features)

  • n_neighbors (int) – Number of nearest neighbors

  • batch_size (int, optional) – Batch size for querying neighbors (queries in batches against full dataset). Default 50000 balances GPU memory and performance. Set to None to query all at once.

Returns:

tuple

Return type:

(distances, indices) arrays of shape (n_samples, n_neighbors+1)

dire_rapids.metrics.make_knn_graph_cpu(data, n_neighbors, batch_size=10000)[source]

CPU fallback for kNN graph construction.

Parameters:
  • data (array-like) – Data points

  • n_neighbors (int) – Number of nearest neighbors

  • batch_size (int, optional) – Batch size for querying neighbors (queries in batches against full dataset). Default 10000 provides good balance between memory and performance. Set to None to query all at once.

Returns:

tuple

Return type:

(distances, indices) arrays

dire_rapids.metrics.compute_stress(data, layout, n_neighbors, eps=1e-06, use_gpu=True)[source]

Compute normalized stress (distortion) of an embedding.

This metric measures how well distances are preserved between the high-dimensional data and low-dimensional layout.

Parameters:
  • data (array-like) – High-dimensional data (n_samples, n_features)

  • layout (array-like) – Low-dimensional embedding (n_samples, n_components)

  • n_neighbors (int) – Number of nearest neighbors to consider

  • eps (float) – Small constant to prevent division by zero

  • use_gpu (bool) – Whether to use GPU acceleration

Returns:

float

Return type:

Normalized stress value

dire_rapids.metrics.compute_neighbor_score(data, layout, n_neighbors, use_gpu=True)[source]

Compute neighborhood preservation score.

Measures how well k-nearest neighbor relationships are preserved from high-dimensional to low-dimensional space.

Parameters:
  • data (array-like) – High-dimensional data

  • layout (array-like) – Low-dimensional embedding

  • n_neighbors (int) – Number of neighbors to consider

  • use_gpu (bool) – Whether to use GPU acceleration

Returns:

list

Return type:

[mean_score, std_score]

dire_rapids.metrics.compute_local_metrics(data, layout, n_neighbors, subsample_threshold=1.0, random_state=42, use_gpu=True)[source]

Compute local quality metrics (stress and neighborhood preservation).

Parameters:
  • data (array-like) – High-dimensional data

  • layout (array-like) – Low-dimensional embedding

  • n_neighbors (int) – Number of neighbors for kNN graph

  • subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0, default 1.0 = no subsampling)

  • random_state (int) – Random seed for subsampling

  • use_gpu (bool) – Whether to use GPU acceleration

Returns:

dict

Return type:

Dictionary containing ‘stress’ and ‘neighbor’ metrics

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0

dire_rapids.metrics.compute_svm_accuracy(X, y, test_size=0.3, reg_param=1.0, max_iter=1000, random_state=42, use_gpu=True)[source]

Compute SVM classification accuracy.

Parameters:
  • X (array-like) – Features

  • y (array-like) – Labels

  • test_size (float) – Test set proportion

  • reg_param (float) – Regularization parameter

  • max_iter (int) – Maximum iterations

  • random_state (int) – Random seed

  • use_gpu (bool) – Whether to use cuML GPU acceleration

Returns:

float

Return type:

Classification accuracy

dire_rapids.metrics.compute_knn_accuracy(X, y, n_neighbors=16, test_size=0.3, random_state=42, use_gpu=True)[source]

Compute kNN classification accuracy.

Parameters:
  • X (array-like) – Features

  • y (array-like) – Labels

  • n_neighbors (int) – Number of neighbors

  • test_size (float) – Test set proportion

  • random_state (int) – Random seed

  • use_gpu (bool) – Whether to use cuML GPU acceleration

Returns:

float

Return type:

Classification accuracy

dire_rapids.metrics.compute_svm_score(data, layout, labels, subsample_threshold=0.5, random_state=42, use_gpu=True, **kwargs)[source]

Compute SVM context preservation score.

Compares SVM classification accuracy on high-dimensional data vs low-dimensional embedding.

Parameters:
  • data (array-like) – High-dimensional data

  • layout (array-like) – Low-dimensional embedding

  • labels (array-like) – Class labels

  • subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0)

  • random_state (int) – Random seed

  • use_gpu (bool) – Whether to use GPU acceleration

  • **kwargs (dict) – Additional parameters for SVM

Returns:

ndarray

Return type:

[acc_hd, acc_ld, log_ratio]

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0

dire_rapids.metrics.compute_knn_score(data, layout, labels, n_neighbors=16, subsample_threshold=0.5, random_state=42, use_gpu=True, **kwargs)[source]

Compute kNN context preservation score.

Compares kNN classification accuracy on high-dimensional data vs low-dimensional embedding.

Parameters:
  • data (array-like) – High-dimensional data

  • layout (array-like) – Low-dimensional embedding

  • labels (array-like) – Class labels

  • n_neighbors (int) – Number of neighbors for kNN

  • subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0)

  • random_state (int) – Random seed

  • use_gpu (bool) – Whether to use GPU acceleration

  • **kwargs (dict) – Additional parameters

Returns:

ndarray

Return type:

[acc_hd, acc_ld, log_ratio]

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0

dire_rapids.metrics.compute_context_measures(data, layout, labels, subsample_threshold=0.5, n_neighbors=16, random_state=42, use_gpu=True, **kwargs)[source]

Compute context preservation measures (SVM and kNN).

Parameters:
  • data (array-like) – High-dimensional data

  • layout (array-like) – Low-dimensional embedding

  • labels (array-like) – Class labels

  • subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0)

  • n_neighbors (int) – Number of neighbors for kNN

  • random_state (int) – Random seed

  • use_gpu (bool) – Whether to use GPU acceleration

  • **kwargs (dict) – Additional parameters

Returns:

dict

Return type:

Dictionary with ‘svm’ and ‘knn’ scores

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0

dire_rapids.metrics.compute_h0_h1_knn(data, k_neighbors=20, density_threshold=0.8, overlap_factor=1.5, use_gpu=True, return_distances=False)[source]

Compute H0/H1 using local kNN atlas approach.

Build dense local triangulations around each point, then merge consistently. This avoids the “holes” problem of global sparse kNN graphs.

Automatically selects between GPU and CPU implementation based on availability and use_gpu parameter.

Parameters:
  • data (array-like) – Point cloud data (n_samples, n_features)

  • k_neighbors (int) – Size of local neighborhood (default 20, recommended 15-20 for noisy data)

  • density_threshold (float) – Percentile threshold for edge inclusion (0-1). Lower = denser triangulation. Default 0.8 means edges up to 80th percentile of local distances are included.

  • overlap_factor (float) – Factor for expanding local neighborhoods to ensure overlap (default 1.5). Higher values create more dense, overlapping patches.

  • use_gpu (bool) – Whether to use GPU acceleration (if available)

  • return_distances (bool) – If True, also return edge-to-distance mapping for persistence diagrams

Returns:

tuple – Persistence diagrams with [birth, death] pairs

Return type:

(h0_diagram, h1_diagram) or (h0_diagram, h1_diagram, edge_distances)

dire_rapids.metrics.compute_persistence_diagrams_fast(data, layout, k_neighbors=30, use_gpu=True)[source]

Fast computation of H0/H1 persistence diagrams using kNN-based sparse Rips.

Much faster than full Vietoris-Rips as it only uses kNN graph (O(nk) vs O(n²) edges). Builds Rips complex from kNN edges and computes persistence via Ripser.

Note: Does NOT subsample internally - expects already-subsampled data. This avoids double subsampling when called from compute_global_metrics.

Parameters:
  • data (array-like) – High-dimensional data (already subsampled if needed)

  • layout (array-like) – Low-dimensional embedding (already subsampled if needed)

  • k_neighbors (int) – Number of neighbors for kNN graph (default 30, recommended >= 20)

  • use_gpu (bool) – Whether to use GPU for kNN computation (default True)

Returns:

dict

Return type:

{‘data’: [h0_diag, h1_diag], ‘layout’: [h0_diag, h1_diag], ‘backend’: ‘fast’}

dire_rapids.metrics.compute_persistence_diagrams(data, layout, max_dim=1, subsample_threshold=0.5, random_state=42, backend=None, backend_kwargs=None)[source]

Compute persistence diagrams for data and layout.

Parameters:
  • data (array-like) – High-dimensional data

  • layout (array-like) – Low-dimensional embedding

  • max_dim (int) – Maximum homology dimension

  • subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0)

  • random_state (int) – Random seed

  • backend (str, optional) – Persistence backend: ‘fast’, ‘giotto-ph’, ‘ripser++’, ‘ripser’, or None for auto

  • backend_kwargs (dict, optional) – Backend-specific parameters passed to the backend, using defaults unless specified: - ‘fast’: k_neighbors=30, use_gpu=True - ‘giotto-ph’: n_threads=-1, collapse_edges=True, return_generators=False - ‘ripser++’: Any valid parameters for ripserplusplus.run() - ‘ripser’: Any valid parameters for ripser.ripser()

Returns:

dict

Return type:

{‘data’: diagrams_hd, ‘layout’: diagrams_ld, ‘backend’: backend_used}

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0

dire_rapids.metrics.betti_curve(diagram, n_steps=100)[source]

Compute Betti curve from a persistence diagram.

A Betti curve shows the number of topological features that persist at different filtration values.

Parameters:
  • diagram (array-like) – Persistence diagram as list of (birth, death) tuples

  • n_steps (int) – Number of points in the curve

Returns:

tuple

Return type:

(filtration_values, betti_numbers)

dire_rapids.metrics.compute_dtw(axis_x_hd, axis_y_hd, axis_x_ld, axis_y_ld, norm_factor=1.0)[source]

Compute Dynamic Time Warping distance between Betti curves.

Parameters:
  • axis_x_hd (array-like) – High-dimensional Betti curve

  • axis_y_hd (array-like) – High-dimensional Betti curve

  • axis_x_ld (array-like) – Low-dimensional Betti curve

  • axis_y_ld (array-like) – Low-dimensional Betti curve

  • norm_factor (float) – Normalization factor

Returns:

float

Return type:

DTW distance

dire_rapids.metrics.compute_twed(axis_x_hd, axis_y_hd, axis_x_ld, axis_y_ld, norm_factor=1.0)[source]

Compute Time Warp Edit Distance between Betti curves.

Parameters:
  • axis_x_hd (array-like) – High-dimensional Betti curve

  • axis_y_hd (array-like) – High-dimensional Betti curve

  • axis_x_ld (array-like) – Low-dimensional Betti curve

  • axis_y_ld (array-like) – Low-dimensional Betti curve

  • norm_factor (float) – Normalization factor

Returns:

float

Return type:

TWED distance

dire_rapids.metrics.compute_emd(axis_x_hd, axis_y_hd, axis_x_ld, axis_y_ld, adjust_mass=False, norm_factor=1.0)[source]

Compute Earth Mover’s Distance between Betti curves.

Parameters:
  • axis_x_hd (array-like) – High-dimensional Betti curve

  • axis_y_hd (array-like) – High-dimensional Betti curve

  • axis_x_ld (array-like) – Low-dimensional Betti curve

  • axis_y_ld (array-like) – Low-dimensional Betti curve

  • adjust_mass (bool) – Whether to adjust for different total masses

  • norm_factor (float) – Normalization factor

Returns:

float

Return type:

EMD distance

dire_rapids.metrics.compute_wasserstein(diag_hd, diag_ld, norm_factor=1.0)[source]

Compute Wasserstein distance between persistence diagrams.

Simple implementation matching dire-jax (no special handling for infinite features).

Parameters:
  • diag_hd (array-like) – Persistence diagrams (birth, death) pairs

  • diag_ld (array-like) – Persistence diagrams (birth, death) pairs

  • norm_factor (float) – Normalization factor

Returns:

float

Return type:

Wasserstein distance

dire_rapids.metrics.compute_bottleneck(diag_hd, diag_ld, norm_factor=1.0)[source]

Compute bottleneck distance between persistence diagrams.

Handles infinite death times by: 1. Computing bottleneck on finite features 2. Taking max with birth time difference for infinite features

Parameters:
  • diag_hd (array-like) – Persistence diagrams (birth, death) pairs

  • diag_ld (array-like) – Persistence diagrams (birth, death) pairs

  • norm_factor (float) – Normalization factor

Returns:

float

Return type:

Bottleneck distance

dire_rapids.metrics.compute_global_metrics(data, layout, dimension=1, subsample_threshold=0.5, random_state=42, n_steps=100, metrics_only=True, backend=None, backend_kwargs=None)[source]

Compute global topological metrics based on persistence homology.

Computes distances between persistence diagrams and Betti curves: - DTW, TWED, EMD for Betti curves - Wasserstein, Bottleneck for persistence diagrams

Parameters:
  • data (array-like) – High-dimensional data

  • layout (array-like) – Low-dimensional embedding

  • dimension (int) – Maximum homology dimension

  • subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0)

  • random_state (int) – Random seed

  • n_steps (int) – Number of points for Betti curves

  • metrics_only (bool) – If True, return only metrics; otherwise include diagrams and curves

  • backend (str, optional) – Persistence backend: ‘fast’, ‘giotto-ph’, ‘ripser++’, ‘ripser’, or None for auto

  • backend_kwargs (dict, optional) – Backend-specific parameters. See compute_persistence_diagrams() for details.

Returns:

dict

Return type:

Dictionary containing metrics (and optionally diagrams and betti curves)

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0

dire_rapids.metrics.evaluate_embedding(data, layout, labels=None, n_neighbors=16, subsample_threshold=0.5, max_homology_dim=1, random_state=42, use_gpu=True, persistence_backend=None, n_threads=-1, compute_distortion=True, compute_context=True, compute_topology=True, **kwargs)[source]

Comprehensive evaluation of a dimensionality reduction embedding.

Computes distortion, context preservation, and topological metrics.

Parameters:
  • data (array-like) – High-dimensional data (n_samples, n_features)

  • layout (array-like) – Low-dimensional embedding (n_samples, n_components)

  • labels (array-like, optional) – Class labels for context metrics

  • n_neighbors (int) – Number of neighbors for kNN metrics

  • subsample_threshold (float) – Subsampling probability for all metrics (must be between 0.0 and 1.0, default 0.5)

  • max_homology_dim (int) – Maximum homology dimension for persistence

  • random_state (int) – Random seed

  • use_gpu (bool) – Whether to use GPU acceleration

  • persistence_backend (str, optional) – Persistence backend: ‘giotto-ph’, ‘ripser++’, ‘ripser’, or None for auto

  • n_threads (int) – Number of threads for giotto-ph (-1 for all cores)

  • compute_distortion (bool) – Whether to compute distortion metrics (default True)

  • compute_context (bool) – Whether to compute context metrics (default True)

  • compute_topology (bool) – Whether to compute topological metrics (default True)

  • **kwargs (dict) – Additional parameters for specific metrics

Returns:

dict

Return type:

Dictionary with all computed metrics

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0