Python Bindings: Improved unit tests, documentation and unification o…

…f API (nomic-ai#1090) * Makefiles, black, isort * Black and isort * unit tests and generation method * chat context provider * context does not reset * Current state * Fixup * Python bindings with unit tests * GPT4All Python Bindings: chat contexts, tests * New python bindings and backend fixes * Black and Isort * Documentation error * preserved n_predict for backwords compat with langchain --------- Co-authored-by: Adam Treat <[email protected]>
wenliu2 · Jun 30, 2023 · 46a0762 · 46a0762
1 parent 40a3fae
commit 46a0762
Show file tree

Hide file tree

Showing 15 changed files with 442 additions and 412 deletions.
diff --git a/gpt4all-backend/llmodel_c.cpp b/gpt4all-backend/llmodel_c.cpp
@@ -128,6 +128,9 @@ void llmodel_prompt(llmodel_model model, const char *prompt,
     std::function<bool(bool)> recalc_func =
         std::bind(&recalculate_wrapper, std::placeholders::_1, reinterpret_cast<void*>(recalculate_callback));
 
+    if (size_t(ctx->n_past) < wrapper->promptContext.tokens.size())
+        wrapper->promptContext.tokens.resize(ctx->n_past);
+
     // Copy the C prompt context
     wrapper->promptContext.n_past = ctx->n_past;
     wrapper->promptContext.n_ctx = ctx->n_ctx;

diff --git a/gpt4all-bindings/python/.isort.cfg b/gpt4all-bindings/python/.isort.cfg
@@ -0,0 +1,7 @@
+[settings]
+known_third_party=geopy,nltk,np,numpy,pandas,pysbd,fire,torch
+
+line_length=120
+include_trailing_comma=True
+multi_line_output=3
+use_parentheses=True
diff --git a/gpt4all-bindings/python/docs/gpt4all_chat.md b/gpt4all-bindings/python/docs/gpt4all_chat.md
@@ -5,7 +5,7 @@ The [GPT4All Chat Client](https://gpt4all.io) lets you easily interact with any
 It is optimized to run 7-13B parameter LLMs on the CPU's of any computer running OSX/Windows/Linux.
 
 ## Running LLMs on CPU
-The GPT4All Chat UI supports models from all newer versions of `GGML`, `llama.cpp` including the `LLaMA`, `MPT`, `replit` and `GPT-J` architectures. The `falcon` architecture will soon also be supported.
+The GPT4All Chat UI supports models from all newer versions of `GGML`, `llama.cpp` including the `LLaMA`, `MPT`, `replit`,  `GPT-J` and `falcon` architectures
 
 GPT4All maintains an official list of recommended models located in [models.json](https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-chat/metadata/models.json). You can pull request new models to it and if accepted they will show up in the official download dialog.
 

diff --git a/gpt4all-bindings/python/docs/gpt4all_python.md b/gpt4all-bindings/python/docs/gpt4all_python.md
@@ -1,6 +1,6 @@
 # GPT4All Python API
-The `GPT4All` package provides Python bindings and an API to our C/C++ model backend libraries.
-The source code, README, and local build instructions can be found [here](https://github.com/nomic-ai/gpt4all/tree/main/gpt4all-bindings/python).
+The `GPT4All` python package provides bindings to our C/C++ model backend libraries.
+The source code and local build instructions can be found [here](https://github.com/nomic-ai/gpt4all/tree/main/gpt4all-bindings/python).
 
 
 ## Quickstart
@@ -9,29 +9,88 @@ The source code, README, and local build instructions can be found [here](https:
 pip install gpt4all
 ```
 
-In Python, run the following commands to retrieve a GPT4All model and generate a response
-to a prompt.
+=== "GPT4All Example"
+    ``` py
+    from gpt4all import GPT4All
+    model = GPT4All("orca-mini-3b.ggmlv3.q4_0.bin")
+    output = model.generate("The capital of France is ", max_tokens=3)
+    print(output)
+    ```
+=== "Output"
+    ```
+    1. Paris
+    ```
 
-**Download Note:**
-By default, models are stored in `~/.cache/gpt4all/` (you can change this with `model_path`). If the file already exists, model download will be skipped.
+### Chatting with GPT4All
+Local LLMs can be optimized for chat conversions by reusing previous computational history.
+
+Use the GPT4All `chat_session` context manager to hold chat conversations with the model.
+
+=== "GPT4All Example"
+    ``` py
+    model = GPT4All(model_name='orca-mini-3b.ggmlv3.q4_0.bin')
+    with model.chat_session():
+        response = model.generate(prompt='hello', top_k=1)
+        response = model.generate(prompt='write me a short poem', top_k=1)
+        response = model.generate(prompt='thank you', top_k=1)
+        print(model.current_chat_session)
+    ```
+=== "Output"
+    ``` json
+    [
+       {
+          'role': 'user',
+          'content': 'hello'
+       },
+       {
+          'role': 'assistant',
+          'content': 'What is your name?'
+       },
+       {
+          'role': 'user',
+          'content': 'write me a short poem'
+       },
+       {
+          'role': 'assistant',
+          'content': "I would love to help you with that! Here's a short poem I came up with:\nBeneath the autumn leaves,\nThe wind whispers through the trees.\nA gentle breeze, so at ease,\nAs if it were born to play.\nAnd as the sun sets in the sky,\nThe world around us grows still."
+       },
+       {
+          'role': 'user',
+          'content': 'thank you'
+       },
+       {
+          'role': 'assistant',
+          'content': "You're welcome! I hope this poem was helpful or inspiring for you. Let me know if there is anything else I can assist you with."
+       }
+    ]
+    ```
+
+When using GPT4All models in the chat_session context:
+
+- The model is given a prompt template which makes it chatty.
+- Internal K/V caches are preserved from previous conversation history speeding up inference.
 
-```python
-import gpt4all
-gptj = gpt4all.GPT4All("ggml-gpt4all-j-v1.3-groovy")
-messages = [{"role": "user", "content": "Name 3 colors"}]
-gptj.chat_completion(messages)
-```
 
-## Give it a try!
-[Google Colab Tutorial](https://colab.research.google.com/drive/1QRFHV5lj1Kb7_tGZZGZ-E6BfX6izpeMI?usp=sharing)
+### Generation Parameters
 
-## Supported Models
-Python bindings support the following ggml architectures: `gptj`, `llama`, `mpt`. See API reference for more details.
+::: gpt4all.gpt4all.GPT4All.generate
 
-## Best Practices
 
-There are two methods to interface with the underlying language model, `chat_completion()` and `generate()`. Chat completion formats a user-provided message dictionary into a prompt template (see API documentation for more details and options). This will usually produce much better results and is the approach we recommend. You may also prompt the model with `generate()` which will just pass the raw input string to the model. 
+### Streaming Generations
+To interact with GPT4All responses as the model generates, use the `streaming = True` flag during generation.
 
-## API Reference
+=== "GPT4All Example"
+    ``` py
+    from gpt4all import GPT4All
+    model = GPT4All("orca-mini-3b.ggmlv3.q4_0.bin")
+    tokens = []
+    for token in model.generate("The capital of France is", max_tokens=20, streaming=True):
+        tokens.append(token)
+    print(tokens)
+    ```
+=== "Output"
+    ```
+    [' Paris', ' is', ' a', ' city', ' that', ' has', ' been', ' a', ' major', ' cultural', ' and', ' economic', ' center', ' for', ' over', ' ', '2', ',', '0', '0']
+    ```
 
 ::: gpt4all.gpt4all.GPT4All
diff --git a/gpt4all-bindings/python/docs/index.md b/gpt4all-bindings/python/docs/index.md
@@ -6,6 +6,19 @@ Nomic AI oversees contributions to the open-source ecosystem ensuring quality, s
 
 GPT4All software is optimized to run inference of 7-13 billion parameter large language models on the CPUs of laptops, desktops and servers.
 
+=== "GPT4All Example"
+    ``` py
+    from gpt4all import GPT4All
+    model = GPT4All("orca-mini-3b.ggmlv3.q4_0.bin")
+    output = model.generate("The capital of France is ", max_tokens=3)
+    print(output)
+    ```
+=== "Output"
+    ```
+    1. Paris
+    ```
+See [Python Bindings](gpt4all_python.md) to use GPT4All.
+
 ### Navigating the Documentation
 In an effort to ensure cross-operating system and cross-language compatibility, the [GPT4All software ecosystem](https://github.com/nomic-ai/gpt4all)
 is organized as a monorepo with the following structure:
@@ -18,31 +31,31 @@ This C API is then bound to any higher level programming language such as C++, P
 
 Explore detailed documentation for the backend, bindings and chat client in the sidebar.
 ## Models
-GPT4All models are artifacts produced through a process known as neural network quantization.
-A multi-billion parameter transformer decoder usually takes 30+ GB of VRAM to execute a forward pass.
-Most people do not have such a powerful computer or access to GPU hardware. By running trained LLMs through quantization algorithms, 
-GPT4All models can run on your laptop using only 4-8GB of RAM enabling their wide-spread utility.
+The GPT4All software ecosystem is compatible with the following Transformer architectures:
 
-The GPT4All software ecosystem is currently compatible with three variants of the Transformer neural network architecture:
+- `Falcon`
+- `LLaMA` (including `OpenLLaMA`)
+- `MPT` (including `Replit`)
+- `GPTJ`
 
-- LLaMa
+You can find an exhaustive list of supported models on the [website](https://gpt4all.io) or in the [models directory](https://raw.githubusercontent.com/nomic-ai/gpt4all/main/gpt4all-chat/metadata/models.json)
 
-- GPT-J
 
-- MPT
+GPT4All models are artifacts produced through a process known as neural network quantization.
+A multi-billion parameter transformer decoder usually takes 30+ GB of VRAM to execute a forward pass.
+Most people do not have such a powerful computer or access to GPU hardware. By running trained LLMs through quantization algorithms, 
+GPT4All models can run on your laptop using only 4-8GB of RAM enabling their wide-spread usage.
 
 Any model trained with one of these architectures can be quantized and run locally with all GPT4All bindings and in the
 chat client. You can add new variants by contributing the gpt4all-backend.
 
-You can find an exhaustive list of pre-quantized models on the [website](https://gpt4all.io) or in the download pane of the chat client.
-
 ## Frequently Asked Questions
 Find answers to frequently asked questions by searching the [Github issues](https://github.com/nomic-ai/gpt4all/issues) or in the [documentation FAQ](gpt4all_faq.md).
 
 ## Getting the most of your local LLM
 
 **Inference Speed**
-Inference speed of a local LLM depends on two factors: model size and the number of tokens given as input. 
+of a local LLM depends on two factors: model size and the number of tokens given as input. 
 It is not advised to prompt local LLMs with large chunks of context as their inference speed will heavily degrade.
 You will likely want to run GPT4All models on GPU if you would like to utilize context windows larger than 750 tokens. Native GPU support for GPT4All models is planned.
 

diff --git a/gpt4all-bindings/python/gpt4all/__init__.py b/gpt4all-bindings/python/gpt4all/__init__.py
@@ -1,2 +1,2 @@
-from .pyllmodel import LLModel # noqa
-from .gpt4all import GPT4All # noqa
+from .gpt4all import GPT4All  # noqa
+from .pyllmodel import LLModel  # noqa