dire_rapids.metrics module

Performance metrics for dimensionality reduction evaluation.

This module provides GPU-accelerated metrics using RAPIDS cuML for evaluating the quality of dimensionality reduction embeddings, including:

Distortion metrics (stress)
Context preservation metrics (SVM, kNN classification)
Topological metrics (persistence homology, Betti curves)

The module supports multiple backends for persistence computation: - giotto-ph (fastest CPU, multi-threaded) - ripser++ (GPU-accelerated) - ripser (CPU fallback)

dire_rapids.metrics.get_available_persistence_backends()[source]

Get list of available persistence computation backends.

Returns:: dict
Return type:: Dictionary mapping backend names to availability status

dire_rapids.metrics.set_persistence_backend(backend)[source]

Set the persistence computation backend.

Parameters:: backend (str or None) – Backend to use: ‘giotto-ph’, ‘ripser++’, ‘ripser’, or None for auto-selection
Raises:: ValueError – If specified backend is not available

dire_rapids.metrics.get_persistence_backend()[source]

Get the current persistence backend (with auto-selection if None).

Returns:: str
Return type:: Name of the selected backend
Raises:: RuntimeError – If no persistence backend is available

dire_rapids.metrics.welford_update_gpu(count, mean, M2, new_value, finite_threshold=1000000000000.0)[source]

GPU-accelerated Welford’s algorithm update step.

Parameters:

count (cupy.ndarray) – Running count of valid values
mean (cupy.ndarray) – Running mean
M2 (cupy.ndarray) – Running sum of squared differences
new_value (cupy.ndarray) – New values to incorporate
finite_threshold (float) – Maximum magnitude for inclusion

Returns:

tuple

Return type:

Updated (count, mean, M2)

dire_rapids.metrics.welford_finalize_gpu(count, mean, M2)[source]

Finalize Welford’s algorithm to compute mean and std.

Parameters:

count (cupy.ndarray) – Total count of valid values
mean (cupy.ndarray) – Computed mean
M2 (cupy.ndarray) – Sum of squared differences

Returns:

tuple

Return type:

(mean, std)

dire_rapids.metrics.welford_gpu(data)[source]

GPU-accelerated computation of mean and std using Welford’s algorithm.

Parameters:: data (cupy.ndarray) – Input data
Returns:: tuple
Return type:: (mean, std)

dire_rapids.metrics.threshold_subsample_gpu(data, layout, labels=None, threshold=0.5, random_state=42)[source]

GPU-accelerated Bernoulli subsampling of data.

Parameters:

data (array-like) – High-dimensional data
layout (array-like) – Low-dimensional embedding
labels (array-like, optional) – Data labels
threshold (float) – Probability of keeping each sample (must be between 0.0 and 1.0)
random_state (int) – Random seed

Returns:

tuple

Return type:

Subsampled arrays

Raises:

ValueError – If threshold is not between 0.0 and 1.0

dire_rapids.metrics.make_knn_graph_gpu(data, n_neighbors, batch_size=50000)[source]

GPU-accelerated kNN graph construction using cuML.

Parameters:

data (array-like) – Data points (n_samples, n_features)
n_neighbors (int) – Number of nearest neighbors
batch_size (int, optional) – Batch size for querying neighbors (queries in batches against full dataset). Default 50000 balances GPU memory and performance. Set to None to query all at once.

Returns:

tuple

Return type:

(distances, indices) arrays of shape (n_samples, n_neighbors+1)

dire_rapids.metrics.make_knn_graph_cpu(data, n_neighbors, batch_size=10000)[source]

CPU fallback for kNN graph construction.

Parameters:

data (array-like) – Data points
n_neighbors (int) – Number of nearest neighbors
batch_size (int, optional) – Batch size for querying neighbors (queries in batches against full dataset). Default 10000 provides good balance between memory and performance. Set to None to query all at once.

Returns:

tuple

Return type:

(distances, indices) arrays

dire_rapids.metrics.compute_stress(data, layout, n_neighbors, eps=1e-06, use_gpu=True)[source]

Compute normalized stress (distortion) of an embedding.

This metric measures how well distances are preserved between the high-dimensional data and low-dimensional layout.

Parameters:

data (array-like) – High-dimensional data (n_samples, n_features)
layout (array-like) – Low-dimensional embedding (n_samples, n_components)
n_neighbors (int) – Number of nearest neighbors to consider
eps (float) – Small constant to prevent division by zero
use_gpu (bool) – Whether to use GPU acceleration

Returns:

float

Return type:

Normalized stress value

dire_rapids.metrics.compute_neighbor_score(data, layout, n_neighbors, use_gpu=True)[source]

Compute neighborhood preservation score.

Measures how well k-nearest neighbor relationships are preserved from high-dimensional to low-dimensional space.

Parameters:

data (array-like) – High-dimensional data
layout (array-like) – Low-dimensional embedding
n_neighbors (int) – Number of neighbors to consider
use_gpu (bool) – Whether to use GPU acceleration

Returns:

list

Return type:

[mean_score, std_score]

dire_rapids.metrics.compute_local_metrics(data, layout, n_neighbors, subsample_threshold=1.0, random_state=42, use_gpu=True)[source]

Compute local quality metrics (stress and neighborhood preservation).

Parameters:

data (array-like) – High-dimensional data
layout (array-like) – Low-dimensional embedding
n_neighbors (int) – Number of neighbors for kNN graph
subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0, default 1.0 = no subsampling)
random_state (int) – Random seed for subsampling
use_gpu (bool) – Whether to use GPU acceleration

Returns:

dict

Return type:

Dictionary containing ‘stress’ and ‘neighbor’ metrics

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0

dire_rapids.metrics.compute_svm_accuracy(X, y, test_size=0.3, reg_param=1.0, max_iter=1000, random_state=42, use_gpu=True)[source]

Compute SVM classification accuracy.

Parameters:

X (array-like) – Features
y (array-like) – Labels
test_size (float) – Test set proportion
reg_param (float) – Regularization parameter
max_iter (int) – Maximum iterations
random_state (int) – Random seed
use_gpu (bool) – Whether to use cuML GPU acceleration

Returns:

float

Return type:

Classification accuracy

dire_rapids.metrics.compute_knn_accuracy(X, y, n_neighbors=16, test_size=0.3, random_state=42, use_gpu=True)[source]

Compute kNN classification accuracy.

Parameters:

X (array-like) – Features
y (array-like) – Labels
n_neighbors (int) – Number of neighbors
test_size (float) – Test set proportion
random_state (int) – Random seed
use_gpu (bool) – Whether to use cuML GPU acceleration

Returns:

float

Return type:

Classification accuracy

dire_rapids.metrics.compute_svm_score(data, layout, labels, subsample_threshold=0.5, random_state=42, use_gpu=True, **kwargs)[source]

Compute SVM context preservation score.

Compares SVM classification accuracy on high-dimensional data vs low-dimensional embedding.

Parameters:

data (array-like) – High-dimensional data
layout (array-like) – Low-dimensional embedding
labels (array-like) – Class labels
subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0)
random_state (int) – Random seed
use_gpu (bool) – Whether to use GPU acceleration
**kwargs (dict) – Additional parameters for SVM

Returns:

ndarray

Return type:

[acc_hd, acc_ld, log_ratio]

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0

dire_rapids.metrics.compute_knn_score(data, layout, labels, n_neighbors=16, subsample_threshold=0.5, random_state=42, use_gpu=True, **kwargs)[source]

Compute kNN context preservation score.

Compares kNN classification accuracy on high-dimensional data vs low-dimensional embedding.

Parameters:

data (array-like) – High-dimensional data
layout (array-like) – Low-dimensional embedding
labels (array-like) – Class labels
n_neighbors (int) – Number of neighbors for kNN
subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0)
random_state (int) – Random seed
use_gpu (bool) – Whether to use GPU acceleration
**kwargs (dict) – Additional parameters

Returns:

ndarray

Return type:

[acc_hd, acc_ld, log_ratio]

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0

dire_rapids.metrics.compute_context_measures(data, layout, labels, subsample_threshold=0.5, n_neighbors=16, random_state=42, use_gpu=True, **kwargs)[source]

Compute context preservation measures (SVM and kNN).

Parameters:

data (array-like) – High-dimensional data
layout (array-like) – Low-dimensional embedding
labels (array-like) – Class labels
subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0)
n_neighbors (int) – Number of neighbors for kNN
random_state (int) – Random seed
use_gpu (bool) – Whether to use GPU acceleration
**kwargs (dict) – Additional parameters

Returns:

dict

Return type:

Dictionary with ‘svm’ and ‘knn’ scores

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0

dire_rapids.metrics.compute_h0_h1_knn(data, k_neighbors=20, density_threshold=0.8, overlap_factor=1.5, use_gpu=True, return_distances=False)[source]

Compute H0/H1 using local kNN atlas approach.

Build dense local triangulations around each point, then merge consistently. This avoids the “holes” problem of global sparse kNN graphs.

Automatically selects between GPU and CPU implementation based on availability and use_gpu parameter.

Parameters:

data (array-like) – Point cloud data (n_samples, n_features)
k_neighbors (int) – Size of local neighborhood (default 20, recommended 15-20 for noisy data)
density_threshold (float) – Percentile threshold for edge inclusion (0-1). Lower = denser triangulation. Default 0.8 means edges up to 80th percentile of local distances are included.
overlap_factor (float) – Factor for expanding local neighborhoods to ensure overlap (default 1.5). Higher values create more dense, overlapping patches.
use_gpu (bool) – Whether to use GPU acceleration (if available)
return_distances (bool) – If True, also return edge-to-distance mapping for persistence diagrams

Returns:

tuple – Persistence diagrams with [birth, death] pairs

Return type:

(h0_diagram, h1_diagram) or (h0_diagram, h1_diagram, edge_distances)

dire_rapids.metrics.compute_persistence_diagrams_fast(data, layout, k_neighbors=30, use_gpu=True)[source]

Fast computation of H0/H1 persistence diagrams using kNN-based sparse Rips.

Much faster than full Vietoris-Rips as it only uses kNN graph (O(nk) vs O(n²) edges). Builds Rips complex from kNN edges and computes persistence via Ripser.

Note: Does NOT subsample internally - expects already-subsampled data. This avoids double subsampling when called from compute_global_metrics.

Parameters:

data (array-like) – High-dimensional data (already subsampled if needed)
layout (array-like) – Low-dimensional embedding (already subsampled if needed)
k_neighbors (int) – Number of neighbors for kNN graph (default 30, recommended >= 20)
use_gpu (bool) – Whether to use GPU for kNN computation (default True)

Returns:

dict

Return type:

{‘data’: [h0_diag, h1_diag], ‘layout’: [h0_diag, h1_diag], ‘backend’: ‘fast’}

dire_rapids.metrics.compute_persistence_diagrams(data, layout, max_dim=1, subsample_threshold=0.5, random_state=42, backend=None, backend_kwargs=None)[source]

Compute persistence diagrams for data and layout.

Parameters:

data (array-like) – High-dimensional data
layout (array-like) – Low-dimensional embedding
max_dim (int) – Maximum homology dimension
subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0)
random_state (int) – Random seed
backend (str, optional) – Persistence backend: ‘fast’, ‘giotto-ph’, ‘ripser++’, ‘ripser’, or None for auto
backend_kwargs (dict, optional) – Backend-specific parameters passed to the backend, using defaults unless specified: - ‘fast’: k_neighbors=30, use_gpu=True - ‘giotto-ph’: n_threads=-1, collapse_edges=True, return_generators=False - ‘ripser++’: Any valid parameters for ripserplusplus.run() - ‘ripser’: Any valid parameters for ripser.ripser()

Returns:

dict

Return type:

{‘data’: diagrams_hd, ‘layout’: diagrams_ld, ‘backend’: backend_used}

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0

dire_rapids.metrics.betti_curve(diagram, n_steps=100)[source]

Compute Betti curve from a persistence diagram.

A Betti curve shows the number of topological features that persist at different filtration values.

Parameters:

diagram (array-like) – Persistence diagram as list of (birth, death) tuples
n_steps (int) – Number of points in the curve

Returns:

tuple

Return type:

(filtration_values, betti_numbers)

dire_rapids.metrics.compute_dtw(axis_x_hd, axis_y_hd, axis_x_ld, axis_y_ld, norm_factor=1.0)[source]

Compute Dynamic Time Warping distance between Betti curves.

Parameters:

axis_x_hd (array-like) – High-dimensional Betti curve
axis_y_hd (array-like) – High-dimensional Betti curve
axis_x_ld (array-like) – Low-dimensional Betti curve
axis_y_ld (array-like) – Low-dimensional Betti curve
norm_factor (float) – Normalization factor

Returns:

float

Return type:

DTW distance

dire_rapids.metrics.compute_twed(axis_x_hd, axis_y_hd, axis_x_ld, axis_y_ld, norm_factor=1.0)[source]

Compute Time Warp Edit Distance between Betti curves.

Parameters:

axis_x_hd (array-like) – High-dimensional Betti curve
axis_y_hd (array-like) – High-dimensional Betti curve
axis_x_ld (array-like) – Low-dimensional Betti curve
axis_y_ld (array-like) – Low-dimensional Betti curve
norm_factor (float) – Normalization factor

Returns:

float

Return type:

TWED distance

dire_rapids.metrics.compute_emd(axis_x_hd, axis_y_hd, axis_x_ld, axis_y_ld, adjust_mass=False, norm_factor=1.0)[source]

Compute Earth Mover’s Distance between Betti curves.

Parameters:

axis_x_hd (array-like) – High-dimensional Betti curve
axis_y_hd (array-like) – High-dimensional Betti curve
axis_x_ld (array-like) – Low-dimensional Betti curve
axis_y_ld (array-like) – Low-dimensional Betti curve
adjust_mass (bool) – Whether to adjust for different total masses
norm_factor (float) – Normalization factor

Returns:

float

Return type:

EMD distance

dire_rapids.metrics.compute_wasserstein(diag_hd, diag_ld, norm_factor=1.0)[source]

Compute Wasserstein distance between persistence diagrams.

Simple implementation matching dire-jax (no special handling for infinite features).

Parameters:

diag_hd (array-like) – Persistence diagrams (birth, death) pairs
diag_ld (array-like) – Persistence diagrams (birth, death) pairs
norm_factor (float) – Normalization factor

Returns:

float

Return type:

Wasserstein distance

dire_rapids.metrics.compute_bottleneck(diag_hd, diag_ld, norm_factor=1.0)[source]

Compute bottleneck distance between persistence diagrams.

Handles infinite death times by: 1. Computing bottleneck on finite features 2. Taking max with birth time difference for infinite features

Parameters:

diag_hd (array-like) – Persistence diagrams (birth, death) pairs
diag_ld (array-like) – Persistence diagrams (birth, death) pairs
norm_factor (float) – Normalization factor

Returns:

float

Return type:

Bottleneck distance

dire_rapids.metrics.compute_global_metrics(data, layout, dimension=1, subsample_threshold=0.5, random_state=42, n_steps=100, metrics_only=True, backend=None, backend_kwargs=None)[source]

Compute global topological metrics based on persistence homology.

Computes distances between persistence diagrams and Betti curves: - DTW, TWED, EMD for Betti curves - Wasserstein, Bottleneck for persistence diagrams

Parameters:

data (array-like) – High-dimensional data
layout (array-like) – Low-dimensional embedding
dimension (int) – Maximum homology dimension
subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0)
random_state (int) – Random seed
n_steps (int) – Number of points for Betti curves
metrics_only (bool) – If True, return only metrics; otherwise include diagrams and curves
backend (str, optional) – Persistence backend: ‘fast’, ‘giotto-ph’, ‘ripser++’, ‘ripser’, or None for auto
backend_kwargs (dict, optional) – Backend-specific parameters. See compute_persistence_diagrams() for details.

Returns:

dict

Return type:

Dictionary containing metrics (and optionally diagrams and betti curves)

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0

dire_rapids.metrics.evaluate_embedding(data, layout, labels=None, n_neighbors=16, subsample_threshold=0.5, max_homology_dim=1, random_state=42, use_gpu=True, persistence_backend=None, n_threads=-1, compute_distortion=True, compute_context=True, compute_topology=True, **kwargs)[source]

Comprehensive evaluation of a dimensionality reduction embedding.

Computes distortion, context preservation, and topological metrics.

Parameters:

data (array-like) – High-dimensional data (n_samples, n_features)
layout (array-like) – Low-dimensional embedding (n_samples, n_components)
labels (array-like, optional) – Class labels for context metrics
n_neighbors (int) – Number of neighbors for kNN metrics
subsample_threshold (float) – Subsampling probability for all metrics (must be between 0.0 and 1.0, default 0.5)
max_homology_dim (int) – Maximum homology dimension for persistence
random_state (int) – Random seed
use_gpu (bool) – Whether to use GPU acceleration
persistence_backend (str, optional) – Persistence backend: ‘giotto-ph’, ‘ripser++’, ‘ripser’, or None for auto
n_threads (int) – Number of threads for giotto-ph (-1 for all cores)
compute_distortion (bool) – Whether to compute distortion metrics (default True)
compute_context (bool) – Whether to compute context metrics (default True)
compute_topology (bool) – Whether to compute topological metrics (default True)
**kwargs (dict) – Additional parameters for specific metrics

Returns:

dict

Return type:

Dictionary with all computed metrics

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0