The Open Australian Legal Embeddings are the first open-source embeddings of Australian legislative and judicial documents. This repository contains the code used to create and update the Embeddings.
If you're looking to download the Embeddings, you may do so on Hugging Face.
To install the Creator, run the following commands:
git clone https://github.com/umarbutler/open-australian-legal-embeddings-creator.git
cd open-australian-legal-embeddings-creator
pip install .
To create or update the Embeddings, simply call mkoale
from the directory in which the Open Australian Legal Corpus is located. By default, this will output the Embeddings to a folder named data
in the current working directory.
The Creator's default behaviour may be modified by passing the following optional arguments to mkoale
:
-i
/--input
: The path to the Open Australian Legal Corpus. Defaults to a file namedcorpus.jsonl
in the current working directory.-o
/--output
: The directory in which the Embeddings should be stored. Defaults to a folder nameddata
in the current working directory.-m
/--model
: The name of the Hugging Face Sentence Transformer embedding model to use. Defaults toBAAI/bge-small-en-v1.5
.-c
/--chunk_size
: The maximum number of tokens a chunk may contain. Defaults to 512.-cb
/--chunking_batch_size
: The maximum number of documents that may be chunked at once. Defaults to 4096.-em
/--embedding_batch_size
: The maximum number of chunks that may be embedded at once. Defaults to 32.
The Creator is licensed under the MIT License.