dire_rapids.metrics module

Performance metrics for dimensionality reduction evaluation.

This module provides GPU-accelerated metrics using RAPIDS cuML for evaluating the quality of dimensionality reduction embeddings, including:

  • Distortion metrics (stress)

  • Context preservation metrics (SVM, kNN classification)

  • Topological metrics (persistence homology, Betti curves)

The module supports multiple backends for persistence computation: - giotto-ph (fastest CPU, multi-threaded) - ripser++ (GPU-accelerated) - ripser (CPU fallback)

dire_rapids.metrics.welford_update_gpu(count, mean, M2, new_value, finite_threshold=1000000000000.0)[source]

GPU-accelerated Welford’s algorithm update step.

Parameters:
  • count (cupy.ndarray) – Running count of valid values

  • mean (cupy.ndarray) – Running mean

  • M2 (cupy.ndarray) – Running sum of squared differences

  • new_value (cupy.ndarray) – New values to incorporate

  • finite_threshold (float) – Maximum magnitude for inclusion

Returns:

tuple

Return type:

Updated (count, mean, M2)

dire_rapids.metrics.welford_finalize_gpu(count, mean, M2)[source]

Finalize Welford’s algorithm to compute mean and std.

Parameters:
  • count (cupy.ndarray) – Total count of valid values

  • mean (cupy.ndarray) – Computed mean

  • M2 (cupy.ndarray) – Sum of squared differences

Returns:

tuple

Return type:

(mean, std)

dire_rapids.metrics.welford_gpu(data)[source]

GPU-accelerated computation of mean and std.

Parameters:

data (cupy.ndarray or numpy.ndarray) – Input data

Returns:

tuple

Return type:

(mean, std)

dire_rapids.metrics.threshold_subsample_gpu(data, layout, labels=None, threshold=0.5, random_state=42)[source]

GPU-accelerated Bernoulli subsampling of data.

Parameters:
  • data (array-like) – High-dimensional data

  • layout (array-like) – Low-dimensional embedding

  • labels (array-like, optional) – Data labels

  • threshold (float) – Probability of keeping each sample (must be between 0.0 and 1.0)

  • random_state (int) – Random seed

Returns:

tuple

Return type:

Subsampled arrays

Raises:

ValueError – If threshold is not between 0.0 and 1.0

dire_rapids.metrics.make_knn_graph_gpu(data, n_neighbors, batch_size=50000)[source]

GPU-accelerated kNN graph construction using cuML.

Parameters:
  • data (array-like) – Data points (n_samples, n_features)

  • n_neighbors (int) – Number of nearest neighbors

  • batch_size (int, optional) – Batch size for querying neighbors (queries in batches against full dataset). Default 50000 balances GPU memory and performance. Set to None to query all at once.

Returns:

tuple

Return type:

(distances, indices) arrays of shape (n_samples, n_neighbors+1)

dire_rapids.metrics.make_knn_graph_cpu(data, n_neighbors, batch_size=10000)[source]

CPU fallback for kNN graph construction.

Parameters:
  • data (array-like) – Data points

  • n_neighbors (int) – Number of nearest neighbors

  • batch_size (int, optional) – Batch size for querying neighbors (queries in batches against full dataset). Default 10000 provides good balance between memory and performance. Set to None to query all at once.

Returns:

tuple

Return type:

(distances, indices) arrays

dire_rapids.metrics.compute_stress(data, layout, n_neighbors, eps=1e-06, use_gpu=True)[source]

Compute normalized stress (distortion) of an embedding.

This metric measures how well distances are preserved between the high-dimensional data and low-dimensional layout.

Parameters:
  • data (array-like) – High-dimensional data (n_samples, n_features)

  • layout (array-like) – Low-dimensional embedding (n_samples, n_components)

  • n_neighbors (int) – Number of nearest neighbors to consider

  • eps (float) – Small constant to prevent division by zero

  • use_gpu (bool) – Whether to use GPU acceleration

Returns:

float

Return type:

Normalized stress value

dire_rapids.metrics.compute_neighbor_score(data, layout, n_neighbors, use_gpu=True)[source]

Compute neighborhood preservation score.

Measures how well k-nearest neighbor relationships are preserved from high-dimensional to low-dimensional space.

Parameters:
  • data (array-like) – High-dimensional data

  • layout (array-like) – Low-dimensional embedding

  • n_neighbors (int) – Number of neighbors to consider

  • use_gpu (bool) – Whether to use GPU acceleration

Returns:

list

Return type:

[mean_score, std_score]

dire_rapids.metrics.compute_local_metrics(data, layout, n_neighbors, subsample_threshold=1.0, random_state=42, use_gpu=True)[source]

Compute local quality metrics (stress and neighborhood preservation).

Parameters:
  • data (array-like) – High-dimensional data

  • layout (array-like) – Low-dimensional embedding

  • n_neighbors (int) – Number of neighbors for kNN graph

  • subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0, default 1.0 = no subsampling)

  • random_state (int) – Random seed for subsampling

  • use_gpu (bool) – Whether to use GPU acceleration

Returns:

dict

Return type:

Dictionary containing ‘stress’ and ‘neighbor’ metrics

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0

dire_rapids.metrics.compute_svm_accuracy(X, y, test_size=0.3, reg_param=1.0, max_iter=1000, random_state=42, use_gpu=True)[source]

Compute SVM classification accuracy.

Parameters:
  • X (array-like) – Features

  • y (array-like) – Labels

  • test_size (float) – Test set proportion

  • reg_param (float) – Regularization parameter

  • max_iter (int) – Maximum iterations

  • random_state (int) – Random seed

  • use_gpu (bool) – Whether to use cuML GPU acceleration

Returns:

float

Return type:

Classification accuracy

dire_rapids.metrics.compute_knn_accuracy(X, y, n_neighbors=16, test_size=0.3, random_state=42, use_gpu=True)[source]

Compute kNN classification accuracy.

Parameters:
  • X (array-like) – Features

  • y (array-like) – Labels

  • n_neighbors (int) – Number of neighbors

  • test_size (float) – Test set proportion

  • random_state (int) – Random seed

  • use_gpu (bool) – Whether to use cuML GPU acceleration

Returns:

float

Return type:

Classification accuracy

dire_rapids.metrics.compute_svm_score(data, layout, labels, subsample_threshold=0.5, random_state=42, use_gpu=True, **kwargs)[source]

Compute SVM context preservation score.

Compares SVM classification accuracy on high-dimensional data vs low-dimensional embedding.

Parameters:
  • data (array-like) – High-dimensional data

  • layout (array-like) – Low-dimensional embedding

  • labels (array-like) – Class labels

  • subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0)

  • random_state (int) – Random seed

  • use_gpu (bool) – Whether to use GPU acceleration

  • **kwargs (dict) – Additional parameters for SVM

Returns:

ndarray

Return type:

[acc_hd, acc_ld, log_ratio]

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0

dire_rapids.metrics.compute_knn_score(data, layout, labels, n_neighbors=16, subsample_threshold=0.5, random_state=42, use_gpu=True, **kwargs)[source]

Compute kNN context preservation score.

Compares kNN classification accuracy on high-dimensional data vs low-dimensional embedding.

Parameters:
  • data (array-like) – High-dimensional data

  • layout (array-like) – Low-dimensional embedding

  • labels (array-like) – Class labels

  • n_neighbors (int) – Number of neighbors for kNN

  • subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0)

  • random_state (int) – Random seed

  • use_gpu (bool) – Whether to use GPU acceleration

  • **kwargs (dict) – Additional parameters

Returns:

ndarray

Return type:

[acc_hd, acc_ld, log_ratio]

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0

dire_rapids.metrics.compute_context_measures(data, layout, labels, subsample_threshold=0.5, n_neighbors=16, random_state=42, use_gpu=True, **kwargs)[source]

Compute context preservation measures (SVM and kNN).

Parameters:
  • data (array-like) – High-dimensional data

  • layout (array-like) – Low-dimensional embedding

  • labels (array-like) – Class labels

  • subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0)

  • n_neighbors (int) – Number of neighbors for kNN

  • random_state (int) – Random seed

  • use_gpu (bool) – Whether to use GPU acceleration

  • **kwargs (dict) – Additional parameters

Returns:

dict

Return type:

Dictionary with ‘svm’ and ‘knn’ scores

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0

dire_rapids.metrics.compute_h0_h1_knn(data, k_neighbors=20, density_threshold=0.8, overlap_factor=1.5, use_gpu=True)[source]

Compute H0/H1 Betti numbers using local kNN atlas approach.

Build dense local triangulations around each point, then merge consistently. This avoids the “holes” problem of global sparse kNN graphs.

Automatically selects between GPU and CPU implementation based on availability and use_gpu parameter.

Parameters:
  • data (array-like) – Point cloud data (n_samples, n_features)

  • k_neighbors (int) – Size of local neighborhood (default 20, recommended 15-20 for noisy data)

  • density_threshold (float) – Percentile threshold for edge inclusion (0-1). Lower = denser triangulation. Default 0.8 means edges up to 80th percentile of local distances are included.

  • overlap_factor (float) – Factor for expanding local neighborhoods to ensure overlap (default 1.5). Higher values create more dense, overlapping patches.

  • use_gpu (bool) – Whether to use GPU acceleration (if available)

Returns:

tuple – Betti numbers: β₀ (connected components), β₁ (loops)

Return type:

(beta_0, beta_1)

dire_rapids.metrics.compute_global_metrics(data, layout, subsample_threshold=0.5, random_state=42, n_steps=100, k_neighbors=20, density_threshold=0.8, overlap_factor=1.5, use_gpu=False, metrics_only=True)[source]

Compute global topological metrics based on Betti curve comparison.

Computes Betti curves for high-dimensional data and low-dimensional embedding using the atlas approach, then compares them using fastDTW distance.

Parameters:
  • data (array-like) – High-dimensional data

  • layout (array-like) – Low-dimensional embedding

  • subsample_threshold (float) – Subsampling probability (must be between 0.0 and 1.0)

  • random_state (int) – Random seed

  • n_steps (int) – Number of points for Betti curves

  • k_neighbors (int) – Size of local neighborhood for atlas approach (default 20)

  • density_threshold (float) – Percentile threshold for edge inclusion (0-1, default 0.8)

  • overlap_factor (float) – Factor for expanding local neighborhoods (default 1.5)

  • use_gpu (bool) – Whether to use GPU acceleration

  • metrics_only (bool) – If True, return only metrics; otherwise include betti curves

Returns:

dict

Return type:

Dictionary containing DTW distances for β₀ and β₁ curves

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0 If fastdtw is not available

dire_rapids.metrics.evaluate_embedding(data, layout, labels=None, n_neighbors=16, subsample_threshold=0.5, random_state=42, use_gpu=True, compute_distortion=True, compute_context=True, compute_topology=True, **kwargs)[source]

Comprehensive evaluation of a dimensionality reduction embedding.

Computes distortion, context preservation, and topological metrics.

Parameters:
  • data (array-like) – High-dimensional data (n_samples, n_features)

  • layout (array-like) – Low-dimensional embedding (n_samples, n_components)

  • labels (array-like, optional) – Class labels for context metrics

  • n_neighbors (int) – Number of neighbors for kNN metrics

  • subsample_threshold (float) – Subsampling probability for all metrics (must be between 0.0 and 1.0, default 0.5)

  • random_state (int) – Random seed

  • use_gpu (bool) – Whether to use GPU acceleration

  • compute_distortion (bool) – Whether to compute distortion metrics (default True)

  • compute_context (bool) – Whether to compute context metrics (default True)

  • compute_topology (bool) – Whether to compute topological metrics (default True)

  • **kwargs (dict) – Additional parameters for specific metrics

Returns:

dict

Return type:

Dictionary with all computed metrics

Raises:

ValueError – If subsample_threshold is not between 0.0 and 1.0