Skip to content

A scientific instrument for investigating latent spaces

License

Notifications You must be signed in to change notification settings

enjalot/latent-scope

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

latent-scope

A lens formed by the embeddings of a model, illuminated by data points and housed by an interactive web interface

Repository overview

This repository is currently meant to run locally, as it has several pieces that use the file system to coordinate functionality.

data

The data directory is where you will put your datasets, and where the scripts and app will store the output of their processes along with the associated metadata. The web app will look at the contents of this folder using a specific directory structure.

client

A React app that provides the interface for operating the scope and running the various scripts

cd client
npm install
npm run dev

Python setup

The following directories depend on a virtual env

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

notebooks

scripts

Python scripts that can be run via CLI or via the web interface (through the server). The scripts assume a certain directory structure in the data folder.
See below for more detailed instructions on using the scripts

python_server

A python server that provides access to the data as well as on-demand nearest neighbor search and simple queries into larger datasets

cd python_server
python server.py

Directory structure

Each dataset in data will have its own directory

├── data/
|   ├── dataset1/
|   |   ├── input.parquet                   # you provide this file
|   |   ├── umaps/
|   |   |   ├── umap-001.parquet                # from umap.py, x,y coordinates
|   |   |   ├── umap-001.json                   # from umap.py, params used
|   |   |   ├── umap-001.png                    # from umap.py, thumbnail of plot
|   |   |   ├── umap-002....                    # subsequent runs increment
|   |   ├── clusters/
|   |   |   ├── clusters-umap-001-001.parquet   # from clusters.py, cluster labels
|   |   |   ├── clusters-umap-001-001.json      # from clusters.py, params used
|   |   |   ├── clusters-umap-001-001.png       # from clusters.py, thumbnail of plot
|   |   |   ├── clusters-umap-001-...           # from clusters.py, thumbnail of plot
|   |   ├── tags/
|   |   |   ├── ❤️.indices                       # tagged by UI, powered by server.py
|   |   |   ├── ...                             # can have arbitrary named tags

Scripts

The scripts should be run in order once you have an input.parquet file in your folder.

csv2parquet.py

A simple utility to convert a csv file into a parquet file. It will write the output parquet file into the proper folder given by the dataset name.

#python csv2parquet.py <csv_file> <dataset_name>
python csv2parquet.py dadjokes.csv database-curated

1. embed.py

Take the text from the input and embed it. Default is to use BAAI/bge-small-en-v1.5 locally via HuggingFace transformers.

# python embed.py <dataset_name> <text_column>
python embed.py dadabase-curated joke

2. umapper.py

Map the embeddings from high-dimensional space to 2D with UMAP. Will generate a thumbnail of the scatterplot.

# python umapper.py <dataset_name> <neighbors> <min_dist>
python umapper.py dadabase-curated 50 0.075 

3. clusters.py

Cluster the UMAP points using HDBSCAN. This will label each point with a cluster label

# python cluster.py <dataset_name> <umap_name> <samples>
cluster.py dadabase-curated umap-005 5

Optional 1D scripts

There are umap-1d.py and cluster-1d.py which will create 1-dimensional umaps and clustering. This can be useful for ordering the data in a list.

TODO: Higher-dimensional clustering