README changes and nits

wangyu-1999 · Aug 28, 2023 · 6dfb137 · 6dfb137
1 parent 3f2c64b
commit 6dfb137
Show file tree

Hide file tree

Showing 3 changed files with 39 additions and 25 deletions.
diff --git a/Makefile b/Makefile
@@ -1,6 +1,6 @@
 .PHONY: start
 start:
-	uvicorn main:app --reload --port 9000
+	uvicorn main:app --reload --port 8080
 
 .PHONY: format
 format:

diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# 🦜️🔗 ChatLangChain
+# 🦜️🔗 Chat LangChain
 
 This repo is an implementation of a locally hosted chatbot specifically focused on question answering over the [LangChain documentation](https://langchain.readthedocs.io/en/latest/).
 Built with [LangChain](https://github.com/hwchase17/langchain/) and [FastAPI](https://fastapi.tiangolo.com/).
@@ -7,35 +7,50 @@ The app leverages LangChain's streaming support and async API to update the page
 
 ## ✅ Running locally
 1. Install dependencies: `pip install -r requirements.txt`
-1. Run `ingest.sh` to ingest LangChain docs data into the vectorstore (only needs to be done once).
+1. Run `python ingest.py` to ingest LangChain docs data into the Weaviate vectorstore (only needs to be done once).
    1. You can use other [Document Loaders](https://langchain.readthedocs.io/en/latest/modules/document_loaders.html) to load your own data into the vectorstore.
-1. Run the app: `make start`
-   1. To enable tracing, make sure `langchain-server` is running locally and pass `tracing=True` to `get_chain` in `main.py`. You can find more documentation [here](https://langchain.readthedocs.io/en/latest/tracing.html).
-1. Open [localhost:9000](http://localhost:9000) in your browser.
+1. Run the app: `make start` for backend and `npm run dev` for frontend (cd into chat-langchain first)
+   1. Make sure to enter your environment variables to configure the application: 
+   ```
+   export OPENAI_API_KEY=
+   export WEAVIATE_URL=
+   export WEAVIATE_API_KEY=
+   
+   # for tracing
+   export LANGCHAIN_TRACING_V2=true
+   export LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
+   export LANGCHAIN_API_KEY=
+   export LANGCHAIN_PROJECT=
+   ```
+1. Open [localhost:3000](http://localhost:3000) in your browser.
 
 ## 🚀 Important Links
 
-Deployed version (to be updated soon): [chat.langchain.dev](https://chat.langchain.dev)
+Deployed version: [chat.langchain.com](https://chat.langchain.com)
 
-Hugging Face Space (to be updated soon): [huggingface.co/spaces/hwchase17/chat-langchain](https://huggingface.co/spaces/hwchase17/chat-langchain)
-
-Blog Posts: 
-* [Initial Launch](https://blog.langchain.dev/langchain-chat/)
-* [Streaming Support](https://blog.langchain.dev/streaming-support-in-langchain/)
 
 ## 📚 Technical description
 
 There are two components: ingestion and question-answering.
 
 Ingestion has the following steps:
 
-1. Pull html from documentation site
-2. Load html with LangChain's [ReadTheDocs Loader](https://langchain.readthedocs.io/en/latest/modules/document_loaders/examples/readthedocs_documentation.html)
-3. Split documents with LangChain's [TextSplitter](https://langchain.readthedocs.io/en/latest/reference/modules/text_splitter.html)
-4. Create a vectorstore of embeddings, using LangChain's [vectorstore wrapper](https://python.langchain.com/en/latest/modules/indexes/vectorstores.html) (with OpenAI's embeddings and FAISS vectorstore).
+1. Pull html from documentation site as well as the Github Codebase
+2. Load html with LangChain's [RecursiveURLLoader Loader](https://python.langchain.com/docs/integrations/document_loaders/recursive_url_loader)
+2. Transform html to text with [Html2TextTransformer](https://python.langchain.com/docs/integrations/document_transformers/html2text)
+3. Split documents with LangChain's [RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html)
+4. Create a vectorstore of embeddings, using LangChain's [Weaviate vectorstore wrapper](https://python.langchain.com/docs/integrations/vectorstores/weaviate) (with OpenAI's embeddings).
 
-Question-Answering has the following steps, all handled by [ChatVectorDBChain](https://langchain.readthedocs.io/en/latest/modules/indexes/chain_examples/chat_vector_db.html):
+Question-Answering has the following steps, all handled by [OpenAIFunctionsAgent](https://python.langchain.com/docs/modules/agents/agent_types/openai_functions_agent):
 
-1. Given the chat history and new user input, determine what a standalone question would be (using GPT-3).
+1. Given the chat history and new user input, determine what a standalone question would be (using GPT-3.5).
 2. Given that standalone question, look up relevant documents from the vectorstore.
-3. Pass the standalone question and relevant documents to GPT-3 to generate a final answer.
+3. Pass the standalone question and relevant documents to GPT-4 to generate and stream the final answer.
+4. Generate a trace URL for the current chat session, as well as the endpoint to collect feedback. 
+
+## Deprecated Links
+Hugging Face Space (to be updated soon): [huggingface.co/spaces/hwchase17/chat-langchain](https://huggingface.co/spaces/hwchase17/chat-langchain)
+
+Blog Posts: 
+* [Initial Launch](https://blog.langchain.dev/langchain-chat/)
+* [Streaming Support](https://blog.langchain.dev/streaming-support-in-langchain/)
diff --git a/ingest.py b/ingest.py
@@ -31,7 +31,6 @@ def ingest_repo():
         parser=LanguageParser(language=Language.PYTHON, parser_threshold=500)
     )
     documents_repo = loader.load()
-    len(documents_repo)
 
     python_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.PYTHON, 
                                                                 chunk_size=2000, 
@@ -55,10 +54,10 @@ def ingest_docs():
     ]
 
     documents = []
-    for url in urls:
-        loader = RecursiveUrlLoader(url=url, max_depth=2 if url == urls[0] else 8, extractor=lambda x: Soup(x, "lxml").text, prevent_outside=True)
-        temp_docs = loader.load()
-        temp_docs = [doc for i, doc in enumerate(temp_docs) if doc not in temp_docs[:i]]        
+    for j, url in enumerate(urls):
+        max_depth = 2 if j == 0 else 10
+        loader = RecursiveUrlLoader(url=url, max_depth=max_depth, extractor=lambda x: Soup(x, "lxml").text, prevent_outside=True)
+        temp_docs = loader.load()           
         documents += temp_docs
         print("Loaded", len(temp_docs), "documents from", url)
 
@@ -91,7 +90,7 @@ def ingest_docs():
     batch_size = 100 # to handle batch size limit 
     for i in range(0, len(docs_transformed), batch_size):
         batch = docs_transformed[i:i+batch_size]
-        Weaviate.from_documents(batch, embeddings, client=client, by_text=False, index_name="LangChain_newest_idx")
+        Weaviate.add_documents(batch, embeddings, client=client, by_text=False, index_name="LangChain_newest_idx")
 
     print("LangChain now has this many vectors", client.query.aggregate("LangChain_newest_idx").with_meta_count().do())