tags | |||||
---|---|---|---|---|---|
|
haystack.deepset.ai, MongoDb, Python, Ollama
This project demonstrates how to set up a web crawl data processing pipeline using MongoDB, Haystack, and Ollama. The primary purpose is to fetch documents from a MongoDB replica set, process them, and retrieve information based on specific queries using the Haystack framework.
- Python 3.8+
- MongoDB
- Haystack
- Ollama
main.py
: The main script that sets up and runs the data processing pipeline.
-
Clone the Repository
git clone https://github.com/psenger/haystack-needle cd haystack-needle
-
Install Dependencies
pip install -r requirements.txt
-
Configure MongoDB Connection
Make sure your MongoDB replica set is running and accessible. The URI in the script is set to:
uri = "mongodb://mongo-1:27017,mongo-2:27117,mongo-3:27217/web-crawl"
Modify this as per your MongoDB setup.
-
Run the Script
Execute the
main.py
script to fetch documents from MongoDB, process them with Haystack, and run queries.python main.py
The script connects to a MongoDB replica set without authentication. Ensure your MongoDB instance is configured correctly.
Haystack is used for setting up the document store and the pipeline. Key components include:
- InMemoryDocumentStore: Stores documents in memory for retrieval.
- InMemoryBM25Retriever: Retrieves relevant documents based on BM25 algorithm.
- PromptBuilder: Builds the prompt for the query.
- OllamaGenerator: Generates responses using the Ollama model.
Ollama is used to generate responses based on the context and query provided. The model is specified as llama3
and runs on a local server.
-
MongoDB Connection
Connects to MongoDB and fetches documents:
client = MongoClient(uri_with_options) db = client['web-crawl'] collection = db['pages']
-
Document Processing
Processes and converts MongoDB documents into Haystack
Document
objects:documents = [ Document( id=doc['_id'], content=doc.get('content', ''), meta={ 'title': doc.get('title', ''), 'url': doc.get('url', ''), 'ldJsonScripts': [safe_json_loads(script) for script in doc.get('ldJsonScripts', [])], 'imageUrls': doc.get('imageUrls', []), 'pageHrefs': doc.get('pageHrefs', []), 'linkTags': doc.get('linkTags', []), 'metaTags': doc.get('metaTags', []) } ) for doc in collection.find({}) ]
-
Pipeline Configuration
Sets up the Haystack pipeline with retriever, prompt builder, and Ollama generator:
pipe = Pipeline() pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store)) pipe.add_component("prompt_builder", PromptBuilder(template=template)) pipe.add_component("llm", OllamaGenerator( model="llama3", url="http://localhost:11434/api/generate",
This project is licensed under the Apache License Version 2.0. See the LICENSE file for more details.
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
For any inquiries or support, please contact on LinkedIn.