haystack-needle - MongoDB Web Crawl with Haystack and Ollama

haystack.deepset.ai, MongoDb, Python, Ollama

This project demonstrates how to set up a web crawl data processing pipeline using MongoDB, Haystack, and Ollama. The primary purpose is to fetch documents from a MongoDB replica set, process them, and retrieve information based on specific queries using the Haystack framework.

Requirements

Python 3.8+
MongoDB
Haystack
Ollama

Project Structure

main.py: The main script that sets up and runs the data processing pipeline.

Setup Instructions

Clone the Repository

git clone https://github.com/psenger/haystack-needle  
cd haystack-needle

Install Dependencies
```
pip install -r requirements.txt  
```
Configure MongoDB Connection

Make sure your MongoDB replica set is running and accessible. The URI in the script is set to:
```
uri = "mongodb://mongo-1:27017,mongo-2:27117,mongo-3:27217/web-crawl"  
```
Modify this as per your MongoDB setup.
Run the Script

Execute the main.py script to fetch documents from MongoDB, process them with Haystack, and run queries.
```
python main.py  
```

Components

MongoDB

The script connects to a MongoDB replica set without authentication. Ensure your MongoDB instance is configured correctly.

Haystack

Haystack is used for setting up the document store and the pipeline. Key components include:

InMemoryDocumentStore: Stores documents in memory for retrieval.
InMemoryBM25Retriever: Retrieves relevant documents based on BM25 algorithm.
PromptBuilder: Builds the prompt for the query.
OllamaGenerator: Generates responses using the Ollama model.

Ollama

Ollama is used to generate responses based on the context and query provided. The model is specified as llama3 and runs on a local server.

Script Overview

MongoDB Connection

Connects to MongoDB and fetches documents:

client = MongoClient(uri_with_options)  
db = client['web-crawl']  
collection = db['pages']

Document Processing

Processes and converts MongoDB documents into Haystack Document objects:

documents = [  
    Document(  
        id=doc['_id'],  
        content=doc.get('content', ''),  
        meta={  
            'title': doc.get('title', ''),  
            'url': doc.get('url', ''),  
            'ldJsonScripts': [safe_json_loads(script) for script in doc.get('ldJsonScripts', [])],  
            'imageUrls': doc.get('imageUrls', []),  
            'pageHrefs': doc.get('pageHrefs', []),  
            'linkTags': doc.get('linkTags', []),  
            'metaTags': doc.get('metaTags', [])  
        }  
    ) for doc in collection.find({})  
]

Pipeline Configuration

Sets up the Haystack pipeline with retriever, prompt builder, and Ollama generator:

pipe = Pipeline()  
pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))  
pipe.add_component("prompt_builder", PromptBuilder(template=template))  
pipe.add_component("llm", OllamaGenerator(  
    model="llama3",  
    url="http://localhost:11434/api/generate",

License

This project is licensed under the Apache License Version 2.0. See the LICENSE file for more details.

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

Contact

For any inquiries or support, please contact on LinkedIn.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
haystack_integrations/components		haystack_integrations/components
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
text_embedder_example.py		text_embedder_example.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

haystack-needle - MongoDB Web Crawl with Haystack and Ollama

Requirements

Project Structure

Setup Instructions

Components

MongoDB

Haystack

Ollama

Script Overview

License

Contributing

Contact

About

Releases

Packages

Languages

License

psenger/haystack-needle

Folders and files

Latest commit

History

Repository files navigation

haystack-needle - MongoDB Web Crawl with Haystack and Ollama

Requirements

Project Structure

Setup Instructions

Components

MongoDB

Haystack

Ollama

Script Overview

License

Contributing

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages