dire_rapids.metrics module
Performance metrics for dimensionality reduction evaluation.
This module provides GPU-accelerated metrics using RAPIDS cuML for evaluating the quality of dimensionality reduction embeddings, including:
Distortion metrics (stress)
Context preservation metrics (SVM, kNN classification)
Topological metrics (persistence homology, Betti curves)
The module supports multiple backends for persistence computation: - giotto-ph (fastest CPU, multi-threaded) - ripser++ (GPU-accelerated) - ripser (CPU fallback)
- dire_rapids.metrics.get_available_persistence_backends()[source]
Get list of available persistence computation backends.
- Returns:
dict
- Return type:
Dictionary mapping backend names to availability status
- dire_rapids.metrics.set_persistence_backend(backend)[source]
Set the persistence computation backend.
- Parameters:
backend (str or None) – Backend to use: ‘giotto-ph’, ‘ripser++’, ‘ripser’, or None for auto-selection
- Raises:
ValueError – If specified backend is not available
- dire_rapids.metrics.get_persistence_backend()[source]
Get the current persistence backend (with auto-selection if None).
- Returns:
str
- Return type:
Name of the selected backend
- Raises:
RuntimeError – If no persistence backend is available
- dire_rapids.metrics.welford_update_gpu(count, mean, M2, new_value, finite_threshold=1000000000000.0)[source]
GPU-accelerated Welford’s algorithm update step.
- Parameters:
count (cupy.ndarray) – Running count of valid values
mean (cupy.ndarray) – Running mean
M2 (cupy.ndarray) – Running sum of squared differences
new_value (cupy.ndarray) – New values to incorporate
finite_threshold (float) – Maximum magnitude for inclusion
- Returns:
tuple
- Return type:
Updated (count, mean, M2)
- dire_rapids.metrics.welford_finalize_gpu(count, mean, M2)[source]
Finalize Welford’s algorithm to compute mean and std.
- Parameters:
count (cupy.ndarray) – Total count of valid values
mean (cupy.ndarray) – Computed mean
M2 (cupy.ndarray) – Sum of squared differences
- Returns:
tuple
- Return type:
(mean, std)
- dire_rapids.metrics.welford_gpu(data)[source]
GPU-accelerated computation of mean and std using Welford’s algorithm.
- Parameters:
data (cupy.ndarray) – Input data
- Returns:
tuple
- Return type:
(mean, std)
- dire_rapids.metrics.threshold_subsample_gpu(data, layout, labels=None, threshold=0.5, random_state=42)[source]
GPU-accelerated Bernoulli subsampling of data.
- Parameters:
- Returns:
tuple
- Return type:
Subsampled arrays
- Raises:
ValueError – If threshold is not between 0.0 and 1.0
- dire_rapids.metrics.make_knn_graph_gpu(data, n_neighbors, batch_size=50000)[source]
GPU-accelerated kNN graph construction using cuML.
- Parameters:
- Returns:
tuple
- Return type:
(distances, indices) arrays of shape (n_samples, n_neighbors+1)
- dire_rapids.metrics.make_knn_graph_cpu(data, n_neighbors, batch_size=10000)[source]
CPU fallback for kNN graph construction.
- Parameters:
- Returns:
tuple
- Return type:
(distances, indices) arrays
- dire_rapids.metrics.compute_stress(data, layout, n_neighbors, eps=1e-06, use_gpu=True)[source]
Compute normalized stress (distortion) of an embedding.
This metric measures how well distances are preserved between the high-dimensional data and low-dimensional layout.
- Parameters:
data (array-like) – High-dimensional data (n_samples, n_features)
layout (array-like) – Low-dimensional embedding (n_samples, n_components)
n_neighbors (int) – Number of nearest neighbors to consider
eps (float) – Small constant to prevent division by zero
use_gpu (bool) – Whether to use GPU acceleration
- Returns:
float
- Return type:
Normalized stress value
- dire_rapids.metrics.compute_neighbor_score(data, layout, n_neighbors, use_gpu=True)[source]
Compute neighborhood preservation score.
Measures how well k-nearest neighbor relationships are preserved from high-dimensional to low-dimensional space.
- dire_rapids.metrics.compute_local_metrics(data, layout, n_neighbors, subsample_threshold=1.0, random_state=42, use_gpu=True)[source]
Compute local quality metrics (stress and neighborhood preservation).
- Parameters:
data (array-like) – High-dimensional data
layout (array-like) – Low-dimensional embedding
n_neighbors (int) – Number of neighbors for kNN graph
subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0, default 1.0 = no subsampling)
random_state (int) – Random seed for subsampling
use_gpu (bool) – Whether to use GPU acceleration
- Returns:
dict
- Return type:
Dictionary containing ‘stress’ and ‘neighbor’ metrics
- Raises:
ValueError – If subsample_threshold is not between 0.0 and 1.0
- dire_rapids.metrics.compute_svm_accuracy(X, y, test_size=0.3, reg_param=1.0, max_iter=1000, random_state=42, use_gpu=True)[source]
Compute SVM classification accuracy.
- Parameters:
- Returns:
float
- Return type:
Classification accuracy
- dire_rapids.metrics.compute_knn_accuracy(X, y, n_neighbors=16, test_size=0.3, random_state=42, use_gpu=True)[source]
Compute kNN classification accuracy.
- dire_rapids.metrics.compute_svm_score(data, layout, labels, subsample_threshold=0.5, random_state=42, use_gpu=True, **kwargs)[source]
Compute SVM context preservation score.
Compares SVM classification accuracy on high-dimensional data vs low-dimensional embedding.
- Parameters:
data (array-like) – High-dimensional data
layout (array-like) – Low-dimensional embedding
labels (array-like) – Class labels
subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0)
random_state (int) – Random seed
use_gpu (bool) – Whether to use GPU acceleration
**kwargs (dict) – Additional parameters for SVM
- Returns:
ndarray
- Return type:
[acc_hd, acc_ld, log_ratio]
- Raises:
ValueError – If subsample_threshold is not between 0.0 and 1.0
- dire_rapids.metrics.compute_knn_score(data, layout, labels, n_neighbors=16, subsample_threshold=0.5, random_state=42, use_gpu=True, **kwargs)[source]
Compute kNN context preservation score.
Compares kNN classification accuracy on high-dimensional data vs low-dimensional embedding.
- Parameters:
data (array-like) – High-dimensional data
layout (array-like) – Low-dimensional embedding
labels (array-like) – Class labels
n_neighbors (int) – Number of neighbors for kNN
subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0)
random_state (int) – Random seed
use_gpu (bool) – Whether to use GPU acceleration
**kwargs (dict) – Additional parameters
- Returns:
ndarray
- Return type:
[acc_hd, acc_ld, log_ratio]
- Raises:
ValueError – If subsample_threshold is not between 0.0 and 1.0
- dire_rapids.metrics.compute_context_measures(data, layout, labels, subsample_threshold=0.5, n_neighbors=16, random_state=42, use_gpu=True, **kwargs)[source]
Compute context preservation measures (SVM and kNN).
- Parameters:
data (array-like) – High-dimensional data
layout (array-like) – Low-dimensional embedding
labels (array-like) – Class labels
subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0)
n_neighbors (int) – Number of neighbors for kNN
random_state (int) – Random seed
use_gpu (bool) – Whether to use GPU acceleration
**kwargs (dict) – Additional parameters
- Returns:
dict
- Return type:
Dictionary with ‘svm’ and ‘knn’ scores
- Raises:
ValueError – If subsample_threshold is not between 0.0 and 1.0
- dire_rapids.metrics.compute_h0_h1_knn(data, k_neighbors=20, density_threshold=0.8, overlap_factor=1.5, use_gpu=True, return_distances=False)[source]
Compute H0/H1 using local kNN atlas approach.
Build dense local triangulations around each point, then merge consistently. This avoids the “holes” problem of global sparse kNN graphs.
Automatically selects between GPU and CPU implementation based on availability and use_gpu parameter.
- Parameters:
data (array-like) – Point cloud data (n_samples, n_features)
k_neighbors (int) – Size of local neighborhood (default 20, recommended 15-20 for noisy data)
density_threshold (float) – Percentile threshold for edge inclusion (0-1). Lower = denser triangulation. Default 0.8 means edges up to 80th percentile of local distances are included.
overlap_factor (float) – Factor for expanding local neighborhoods to ensure overlap (default 1.5). Higher values create more dense, overlapping patches.
use_gpu (bool) – Whether to use GPU acceleration (if available)
return_distances (bool) – If True, also return edge-to-distance mapping for persistence diagrams
- Returns:
tuple – Persistence diagrams with [birth, death] pairs
- Return type:
(h0_diagram, h1_diagram) or (h0_diagram, h1_diagram, edge_distances)
- dire_rapids.metrics.compute_persistence_diagrams_fast(data, layout, k_neighbors=30, use_gpu=True)[source]
Fast computation of H0/H1 persistence diagrams using kNN-based sparse Rips.
Much faster than full Vietoris-Rips as it only uses kNN graph (O(nk) vs O(n²) edges). Builds Rips complex from kNN edges and computes persistence via Ripser.
Note: Does NOT subsample internally - expects already-subsampled data. This avoids double subsampling when called from compute_global_metrics.
- Parameters:
data (array-like) – High-dimensional data (already subsampled if needed)
layout (array-like) – Low-dimensional embedding (already subsampled if needed)
k_neighbors (int) – Number of neighbors for kNN graph (default 30, recommended >= 20)
use_gpu (bool) – Whether to use GPU for kNN computation (default True)
- Returns:
dict
- Return type:
{‘data’: [h0_diag, h1_diag], ‘layout’: [h0_diag, h1_diag], ‘backend’: ‘fast’}
- dire_rapids.metrics.compute_persistence_diagrams(data, layout, max_dim=1, subsample_threshold=0.5, random_state=42, backend=None, backend_kwargs=None)[source]
Compute persistence diagrams for data and layout.
- Parameters:
data (array-like) – High-dimensional data
layout (array-like) – Low-dimensional embedding
max_dim (int) – Maximum homology dimension
subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0)
random_state (int) – Random seed
backend (str, optional) – Persistence backend: ‘fast’, ‘giotto-ph’, ‘ripser++’, ‘ripser’, or None for auto
backend_kwargs (dict, optional) – Backend-specific parameters passed to the backend, using defaults unless specified: - ‘fast’: k_neighbors=30, use_gpu=True - ‘giotto-ph’: n_threads=-1, collapse_edges=True, return_generators=False - ‘ripser++’: Any valid parameters for ripserplusplus.run() - ‘ripser’: Any valid parameters for ripser.ripser()
- Returns:
dict
- Return type:
{‘data’: diagrams_hd, ‘layout’: diagrams_ld, ‘backend’: backend_used}
- Raises:
ValueError – If subsample_threshold is not between 0.0 and 1.0
- dire_rapids.metrics.betti_curve(diagram, n_steps=100)[source]
Compute Betti curve from a persistence diagram.
A Betti curve shows the number of topological features that persist at different filtration values.
- Parameters:
diagram (array-like) – Persistence diagram as list of (birth, death) tuples
n_steps (int) – Number of points in the curve
- Returns:
tuple
- Return type:
(filtration_values, betti_numbers)
- dire_rapids.metrics.compute_dtw(axis_x_hd, axis_y_hd, axis_x_ld, axis_y_ld, norm_factor=1.0)[source]
Compute Dynamic Time Warping distance between Betti curves.
- Parameters:
axis_x_hd (array-like) – High-dimensional Betti curve
axis_y_hd (array-like) – High-dimensional Betti curve
axis_x_ld (array-like) – Low-dimensional Betti curve
axis_y_ld (array-like) – Low-dimensional Betti curve
norm_factor (float) – Normalization factor
- Returns:
float
- Return type:
DTW distance
- dire_rapids.metrics.compute_twed(axis_x_hd, axis_y_hd, axis_x_ld, axis_y_ld, norm_factor=1.0)[source]
Compute Time Warp Edit Distance between Betti curves.
- Parameters:
axis_x_hd (array-like) – High-dimensional Betti curve
axis_y_hd (array-like) – High-dimensional Betti curve
axis_x_ld (array-like) – Low-dimensional Betti curve
axis_y_ld (array-like) – Low-dimensional Betti curve
norm_factor (float) – Normalization factor
- Returns:
float
- Return type:
TWED distance
- dire_rapids.metrics.compute_emd(axis_x_hd, axis_y_hd, axis_x_ld, axis_y_ld, adjust_mass=False, norm_factor=1.0)[source]
Compute Earth Mover’s Distance between Betti curves.
- Parameters:
axis_x_hd (array-like) – High-dimensional Betti curve
axis_y_hd (array-like) – High-dimensional Betti curve
axis_x_ld (array-like) – Low-dimensional Betti curve
axis_y_ld (array-like) – Low-dimensional Betti curve
adjust_mass (bool) – Whether to adjust for different total masses
norm_factor (float) – Normalization factor
- Returns:
float
- Return type:
EMD distance
- dire_rapids.metrics.compute_wasserstein(diag_hd, diag_ld, norm_factor=1.0)[source]
Compute Wasserstein distance between persistence diagrams.
Simple implementation matching dire-jax (no special handling for infinite features).
- Parameters:
diag_hd (array-like) – Persistence diagrams (birth, death) pairs
diag_ld (array-like) – Persistence diagrams (birth, death) pairs
norm_factor (float) – Normalization factor
- Returns:
float
- Return type:
Wasserstein distance
- dire_rapids.metrics.compute_bottleneck(diag_hd, diag_ld, norm_factor=1.0)[source]
Compute bottleneck distance between persistence diagrams.
Handles infinite death times by: 1. Computing bottleneck on finite features 2. Taking max with birth time difference for infinite features
- Parameters:
diag_hd (array-like) – Persistence diagrams (birth, death) pairs
diag_ld (array-like) – Persistence diagrams (birth, death) pairs
norm_factor (float) – Normalization factor
- Returns:
float
- Return type:
Bottleneck distance
- dire_rapids.metrics.compute_global_metrics(data, layout, dimension=1, subsample_threshold=0.5, random_state=42, n_steps=100, metrics_only=True, backend=None, backend_kwargs=None)[source]
Compute global topological metrics based on persistence homology.
Computes distances between persistence diagrams and Betti curves: - DTW, TWED, EMD for Betti curves - Wasserstein, Bottleneck for persistence diagrams
- Parameters:
data (array-like) – High-dimensional data
layout (array-like) – Low-dimensional embedding
dimension (int) – Maximum homology dimension
subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0)
random_state (int) – Random seed
n_steps (int) – Number of points for Betti curves
metrics_only (bool) – If True, return only metrics; otherwise include diagrams and curves
backend (str, optional) – Persistence backend: ‘fast’, ‘giotto-ph’, ‘ripser++’, ‘ripser’, or None for auto
backend_kwargs (dict, optional) – Backend-specific parameters. See compute_persistence_diagrams() for details.
- Returns:
dict
- Return type:
Dictionary containing metrics (and optionally diagrams and betti curves)
- Raises:
ValueError – If subsample_threshold is not between 0.0 and 1.0
- dire_rapids.metrics.evaluate_embedding(data, layout, labels=None, n_neighbors=16, subsample_threshold=0.5, max_homology_dim=1, random_state=42, use_gpu=True, persistence_backend=None, n_threads=-1, compute_distortion=True, compute_context=True, compute_topology=True, **kwargs)[source]
Comprehensive evaluation of a dimensionality reduction embedding.
Computes distortion, context preservation, and topological metrics.
- Parameters:
data (array-like) – High-dimensional data (n_samples, n_features)
layout (array-like) – Low-dimensional embedding (n_samples, n_components)
labels (array-like, optional) – Class labels for context metrics
n_neighbors (int) – Number of neighbors for kNN metrics
subsample_threshold (float) – Subsampling probability for all metrics (must be between 0.0 and 1.0, default 0.5)
max_homology_dim (int) – Maximum homology dimension for persistence
random_state (int) – Random seed
use_gpu (bool) – Whether to use GPU acceleration
persistence_backend (str, optional) – Persistence backend: ‘giotto-ph’, ‘ripser++’, ‘ripser’, or None for auto
n_threads (int) – Number of threads for giotto-ph (-1 for all cores)
compute_distortion (bool) – Whether to compute distortion metrics (default True)
compute_context (bool) – Whether to compute context metrics (default True)
compute_topology (bool) – Whether to compute topological metrics (default True)
**kwargs (dict) – Additional parameters for specific metrics
- Returns:
dict
- Return type:
Dictionary with all computed metrics
- Raises:
ValueError – If subsample_threshold is not between 0.0 and 1.0