A lens formed by the embeddings of a model, illuminated by data points and housed by an interactive web interface
This repository is currently meant to run locally, as it has several pieces that use the file system to coordinate functionality.
The data directory is where you will put your datasets, and where the scripts and app will store the output of their processes along with the associated metadata. The web app will look at the contents of this folder using a specific directory structure.
A React app that provides the interface for operating the scope and running the various scripts
cd client
npm install
npm run dev
The following directories depend on a virtual env
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Python scripts that can be run via CLI or via the web interface (through the server). The scripts assume a certain directory structure in the data folder.
See below for more detailed instructions on using the scripts
A python server that provides access to the data as well as on-demand nearest neighbor search and simple queries into larger datasets
cd python_server
python server.py
Each dataset in data will have its own directory
├── data/ | ├── dataset1/ | | ├── input.parquet # you provide this file | | ├── umaps/ | | | ├── umap-001.parquet # from umap.py, x,y coordinates | | | ├── umap-001.json # from umap.py, params used | | | ├── umap-001.png # from umap.py, thumbnail of plot | | | ├── umap-002.... # subsequent runs increment | | ├── clusters/ | | | ├── clusters-umap-001-001.parquet # from clusters.py, cluster labels | | | ├── clusters-umap-001-001.json # from clusters.py, params used | | | ├── clusters-umap-001-001.png # from clusters.py, thumbnail of plot | | | ├── clusters-umap-001-... # from clusters.py, thumbnail of plot | | ├── tags/ | | | ├── ❤️.indices # tagged by UI, powered by server.py | | | ├── ... # can have arbitrary named tags
The scripts should be run in order once you have an input.parquet
file in your folder.
A simple utility to convert a csv file into a parquet file. It will write the output parquet file into the proper folder given by the dataset name.
#python csv2parquet.py <csv_file> <dataset_name>
python csv2parquet.py dadjokes.csv database-curated
Take the text from the input and embed it. Default is to use BAAI/bge-small-en-v1.5
locally via HuggingFace transformers.
# python embed.py <dataset_name> <text_column>
python embed.py dadabase-curated joke
Map the embeddings from high-dimensional space to 2D with UMAP. Will generate a thumbnail of the scatterplot.
# python umapper.py <dataset_name> <neighbors> <min_dist>
python umapper.py dadabase-curated 50 0.075
Cluster the UMAP points using HDBSCAN. This will label each point with a cluster label
# python cluster.py <dataset_name> <umap_name> <samples>
cluster.py dadabase-curated umap-005 5
There are umap-1d.py
and cluster-1d.py
which will create 1-dimensional umaps and clustering. This can be useful for ordering the data in a list.