Skip to content

Commit

Permalink
README changes and nits
Browse files Browse the repository at this point in the history
  • Loading branch information
mcantillon21 committed Aug 28, 2023
1 parent 3f2c64b commit 6dfb137
Show file tree
Hide file tree
Showing 3 changed files with 39 additions and 25 deletions.
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.PHONY: start
start:
uvicorn main:app --reload --port 9000
uvicorn main:app --reload --port 8080

.PHONY: format
format:
Expand Down
51 changes: 33 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 🦜️🔗 ChatLangChain
# 🦜️🔗 Chat LangChain

This repo is an implementation of a locally hosted chatbot specifically focused on question answering over the [LangChain documentation](https://langchain.readthedocs.io/en/latest/).
Built with [LangChain](https://github.com/hwchase17/langchain/) and [FastAPI](https://fastapi.tiangolo.com/).
Expand All @@ -7,35 +7,50 @@ The app leverages LangChain's streaming support and async API to update the page

## ✅ Running locally
1. Install dependencies: `pip install -r requirements.txt`
1. Run `ingest.sh` to ingest LangChain docs data into the vectorstore (only needs to be done once).
1. Run `python ingest.py` to ingest LangChain docs data into the Weaviate vectorstore (only needs to be done once).
1. You can use other [Document Loaders](https://langchain.readthedocs.io/en/latest/modules/document_loaders.html) to load your own data into the vectorstore.
1. Run the app: `make start`
1. To enable tracing, make sure `langchain-server` is running locally and pass `tracing=True` to `get_chain` in `main.py`. You can find more documentation [here](https://langchain.readthedocs.io/en/latest/tracing.html).
1. Open [localhost:9000](http://localhost:9000) in your browser.
1. Run the app: `make start` for backend and `npm run dev` for frontend (cd into chat-langchain first)
1. Make sure to enter your environment variables to configure the application:
```
export OPENAI_API_KEY=
export WEAVIATE_URL=
export WEAVIATE_API_KEY=
# for tracing
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
export LANGCHAIN_API_KEY=
export LANGCHAIN_PROJECT=
```
1. Open [localhost:3000](http://localhost:3000) in your browser.

## 🚀 Important Links

Deployed version (to be updated soon): [chat.langchain.dev](https://chat.langchain.dev)
Deployed version: [chat.langchain.com](https://chat.langchain.com)

Hugging Face Space (to be updated soon): [huggingface.co/spaces/hwchase17/chat-langchain](https://huggingface.co/spaces/hwchase17/chat-langchain)

Blog Posts:
* [Initial Launch](https://blog.langchain.dev/langchain-chat/)
* [Streaming Support](https://blog.langchain.dev/streaming-support-in-langchain/)

## 📚 Technical description

There are two components: ingestion and question-answering.

Ingestion has the following steps:

1. Pull html from documentation site
2. Load html with LangChain's [ReadTheDocs Loader](https://langchain.readthedocs.io/en/latest/modules/document_loaders/examples/readthedocs_documentation.html)
3. Split documents with LangChain's [TextSplitter](https://langchain.readthedocs.io/en/latest/reference/modules/text_splitter.html)
4. Create a vectorstore of embeddings, using LangChain's [vectorstore wrapper](https://python.langchain.com/en/latest/modules/indexes/vectorstores.html) (with OpenAI's embeddings and FAISS vectorstore).
1. Pull html from documentation site as well as the Github Codebase
2. Load html with LangChain's [RecursiveURLLoader Loader](https://python.langchain.com/docs/integrations/document_loaders/recursive_url_loader)
2. Transform html to text with [Html2TextTransformer](https://python.langchain.com/docs/integrations/document_transformers/html2text)
3. Split documents with LangChain's [RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html)
4. Create a vectorstore of embeddings, using LangChain's [Weaviate vectorstore wrapper](https://python.langchain.com/docs/integrations/vectorstores/weaviate) (with OpenAI's embeddings).

Question-Answering has the following steps, all handled by [ChatVectorDBChain](https://langchain.readthedocs.io/en/latest/modules/indexes/chain_examples/chat_vector_db.html):
Question-Answering has the following steps, all handled by [OpenAIFunctionsAgent](https://python.langchain.com/docs/modules/agents/agent_types/openai_functions_agent):

1. Given the chat history and new user input, determine what a standalone question would be (using GPT-3).
1. Given the chat history and new user input, determine what a standalone question would be (using GPT-3.5).
2. Given that standalone question, look up relevant documents from the vectorstore.
3. Pass the standalone question and relevant documents to GPT-3 to generate a final answer.
3. Pass the standalone question and relevant documents to GPT-4 to generate and stream the final answer.
4. Generate a trace URL for the current chat session, as well as the endpoint to collect feedback.

## Deprecated Links
Hugging Face Space (to be updated soon): [huggingface.co/spaces/hwchase17/chat-langchain](https://huggingface.co/spaces/hwchase17/chat-langchain)

Blog Posts:
* [Initial Launch](https://blog.langchain.dev/langchain-chat/)
* [Streaming Support](https://blog.langchain.dev/streaming-support-in-langchain/)
11 changes: 5 additions & 6 deletions ingest.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@ def ingest_repo():
parser=LanguageParser(language=Language.PYTHON, parser_threshold=500)
)
documents_repo = loader.load()
len(documents_repo)

python_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.PYTHON,
chunk_size=2000,
Expand All @@ -55,10 +54,10 @@ def ingest_docs():
]

documents = []
for url in urls:
loader = RecursiveUrlLoader(url=url, max_depth=2 if url == urls[0] else 8, extractor=lambda x: Soup(x, "lxml").text, prevent_outside=True)
temp_docs = loader.load()
temp_docs = [doc for i, doc in enumerate(temp_docs) if doc not in temp_docs[:i]]
for j, url in enumerate(urls):
max_depth = 2 if j == 0 else 10
loader = RecursiveUrlLoader(url=url, max_depth=max_depth, extractor=lambda x: Soup(x, "lxml").text, prevent_outside=True)
temp_docs = loader.load()
documents += temp_docs
print("Loaded", len(temp_docs), "documents from", url)

Expand Down Expand Up @@ -91,7 +90,7 @@ def ingest_docs():
batch_size = 100 # to handle batch size limit
for i in range(0, len(docs_transformed), batch_size):
batch = docs_transformed[i:i+batch_size]
Weaviate.from_documents(batch, embeddings, client=client, by_text=False, index_name="LangChain_newest_idx")
Weaviate.add_documents(batch, embeddings, client=client, by_text=False, index_name="LangChain_newest_idx")

print("LangChain now has this many vectors", client.query.aggregate("LangChain_newest_idx").with_meta_count().do())

Expand Down

0 comments on commit 6dfb137

Please sign in to comment.