This project, "Approximate Vector Set Search: A Bio-Inspired Approach for High-Dimensional Spaces," presents a novel method inspired by biological processes to address the challenges of searching in high-dimensional vector spaces. The approach leverages advanced computational techniques to improve the efficiency and accuracy of approximate vector set searches, a critical task in various domains such as bioinformatics, machine learning, and data mining.
CS vector sample dataset (Link, Access Code: Wzsb). Additional data available from public sources.
The repository is organized into several key directories, each containing specific components of the project:
main.py
andutils.py
: Core scripts for running the main algorithm and utility functions.ComparativeExperiment
: Contains scripts and modules for comparative experiments to evaluate the proposed method against existing approaches._NaiveBioVSS
: Includes a naive implementation of the BioVSS algorithm and related scripts.C++code
: Houses C++ implementations and CUDA code for performance-critical components.data
: Contains benchmark datasets used for testing and evaluation.logs
: Logging utilities for tracking experiments and results.src
: Core source code including data loading, filtering, and refinement modules.
main.py
: Entry point for executing the approximate vector set search algorithm.utils.py
: Utility functions supporting the main algorithm.
ComparativeExperiment/CompareBase.py
: Base class for comparative experiments.ComparativeExperiment/main_c.py
: Script to execute comparative experiments.ComparativeExperiment/CompareMethod
: Directory containing various comparison methods such asBruceExactHausdorff
,IndexHNSWMean
,IndexIVFFlatMean
, etc.ComparativeExperiment/DataLoader
: Data loading utilities for comparative experiments.ComparativeExperiment/logs/Logger.py
: Logging utilities for experiments.
_NaiveBioVSS/hausdorff_distance_naive_lsh.cpp
: C++ implementation of naive Hausdorff distance calculation using LSH._NaiveBioVSS/main.py
: Main script for running the naive implementation._NaiveBioVSS/setup.py
: Setup script for the naive implementation._NaiveBioVSS/test.py
: Testing script for the naive implementation.
C++code/cudaDenseHausdorff
: Directory containing CUDA implementations for dense Hausdorff distance calculations.C++code/cudaDenseHausdorff/haudorff
: Subdirectory for basic Hausdorff distance calculations.C++code/cudaDenseHausdorff/haudorff_lsh
: Subdirectory for LSH-based Hausdorff distance calculations.C++code/cudaDenseHausdorff/src
: Source code for additional implementations.
data/benchmark/CS/test
: Test datasets for benchmarking.logs/Logger.py
: Logging utilities for the main algorithm.
src/Pipeline.py
: Pipeline for executing the vector set search.src/DataLoader
: Data loading modules for the main algorithm.src/Filter
: Filtering modules includingBloomGraph
,IndexCount
, andOverlapVector
.src/Refinement
: Refinement modules includingExactHausdorff
andParallelExactHausdorff
.
- Python 3.10
- Required Python libraries (specified in
requirements.txt
)
First download all the data and compile the C++ code. second, complete all data preparation using BioHash.
- Clone the repository:
git clone https://github.com/CS-BruceChen/biovss.git
- Navigate to the project directory:
cd BioVSS
To execute the main algorithm, run:
python main.py
Detailed instructions for running comparative experiments can be found in the ComparativeExperiment
directory.
Contributions are welcome. Please submit a pull request or contact the authors for major changes.
This project is licensed under the MIT License. See the LICENSE
file for details.
If you use our code for research work, please cite our paper as below:
@inproceedings{li2025approximate,
title={Approximate Vector Set Search: A Bio-Inspired Approach for High-Dimensional Spaces},
author={Li, Yiqi and Wang, Sheng and Chen, Zhiyu and Chen, Shangfeng and Peng, Zhiyong},
booktitle={Proceedings of the 41th International Conference on Data Engineering},
year={2025},
publisher={IEEE},
organization={IEEE}
}