Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
client		client
data		data
notebooks		notebooks
python_server		python_server
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Repository files navigation

latent-scope

A lens formed by the embeddings of a model, illuminated by data points and housed by an interactive web interface

Repository overview

This repository is currently meant to run locally, as it has several pieces that use the file system to coordinate functionality.

data

The data directory is where you will put your datasets, and where the scripts and app will store the output of their processes along with the associated metadata. The web app will look at the contents of this folder using a specific directory structure.

client

A React app that provides the interface for operating the scope and running the various scripts

cd client
npm install
npm run dev

Python setup

The following directories depend on a virtual env

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

notebooks

scripts

Python scripts that can be run via CLI or via the web interface (through the server). The scripts assume a certain directory structure in the data folder.
See below for more detailed instructions on using the scripts

python_server

A python server that provides access to the data as well as on-demand nearest neighbor search and simple queries into larger datasets

cd python_server
python server.py

Directory structure

Each dataset in data will have its own directory

├── data/
|   ├── dataset1/
|   |   ├── input.parquet                   # you provide this file
|   |   ├── umaps/
|   |   |   ├── umap-001.parquet                # from umap.py, x,y coordinates
|   |   |   ├── umap-001.json                   # from umap.py, params used
|   |   |   ├── umap-001.png                    # from umap.py, thumbnail of plot
|   |   |   ├── umap-002....                    # subsequent runs increment
|   |   ├── clusters/
|   |   |   ├── clusters-umap-001-001.parquet   # from clusters.py, cluster labels
|   |   |   ├── clusters-umap-001-001.json      # from clusters.py, params used
|   |   |   ├── clusters-umap-001-001.png       # from clusters.py, thumbnail of plot
|   |   |   ├── clusters-umap-001-...           # from clusters.py, thumbnail of plot
|   |   ├── tags/
|   |   |   ├── ❤️.indices                       # tagged by UI, powered by server.py
|   |   |   ├── ...                             # can have arbitrary named tags

Scripts

The scripts should be run in order once you have an input.parquet file in your folder.

csv2parquet.py

A simple utility to convert a csv file into a parquet file. It will write the output parquet file into the proper folder given by the dataset name.

#python csv2parquet.py <csv_file> <dataset_name>
python csv2parquet.py dadjokes.csv database-curated

1. embed.py

Take the text from the input and embed it. Default is to use BAAI/bge-small-en-v1.5 locally via HuggingFace transformers.

# python embed.py <dataset_name> <text_column>
python embed.py dadabase-curated joke

2. umapper.py

Map the embeddings from high-dimensional space to 2D with UMAP. Will generate a thumbnail of the scatterplot.

# python umapper.py <dataset_name> <neighbors> <min_dist>
python umapper.py dadabase-curated 50 0.075

3. clusters.py

Cluster the UMAP points using HDBSCAN. This will label each point with a cluster label

# python cluster.py <dataset_name> <umap_name> <samples>
cluster.py dadabase-curated umap-005 5

Optional 1D scripts

There are umap-1d.py and cluster-1d.py which will create 1-dimensional umaps and clustering. This can be useful for ordering the data in a list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

latent-scope

Repository overview

data

client

Python setup

notebooks

scripts

python_server

Directory structure

Scripts

csv2parquet.py

1. embed.py

2. umapper.py

3. clusters.py

Optional 1D scripts

TODO: Higher-dimensional clustering

About

Releases 17

Packages

Contributors 7

Languages

License

enjalot/latent-scope

Folders and files

Latest commit

History

Repository files navigation

latent-scope

Repository overview

data

client

Python setup

notebooks

scripts

python_server

Directory structure

Scripts

csv2parquet.py

1. embed.py

2. umapper.py

3. clusters.py

Optional 1D scripts

TODO: Higher-dimensional clustering

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 17

Packages 0

Contributors 7

Languages

Packages