- Overview
- Features
- Installation
- Usage
- API Endpoints
- Configuration
- Project Structure
- Dependencies
- Testing
- Future Improvements
- Error Handling
- Possible issues
The qavanin.ir Scraper and API is a comprehensive solution for extracting, processing, and analyzing legal documents from the qavanin.ir website. It combines web scraping capabilities with natural language processing and a robust API to provide easy access to legal information.
- Web Scraping: Crawls multiple pages from qavanin.ir, extracting legal documents.
- Text Processing: Cleans HTML content and converts it to a structured Markdown format.
- Vector Embeddings: Generates vector embeddings for processed text using
SentenceTransformer
. - Database Storage: Stores original text, processed text, and vector embeddings in PostgreSQL with
pgvector
extension. - FastAPI Endpoints: Provides a RESTful API for querying similar content, updating documents, and more.
- Docker Support: Easily deploy and run the application using Docker.
-
Clone the repository:
git clone https://github.com/MSC72m/qavanin.ir_ve.git cd qavanin-ir_ve/qavanin-ir_ve/database
-
Build the Docker image:
docker build -t pgvector_db .
-
Run the Docker container:
docker run -p 5432:5432 -e POSTGRES_USER=test -e POSTGRES_PASSWORD=test -e POSTGRES_DB=pg-test pgvector_db
-
Create a virtual environment and activate it:
# run this command at root directory /qavanin-ir_ve cd.. if needed python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
-
Set up your PostgreSQL database and install the
pgvector
extension. -
Create a
.env
file in /qavanin-ir_ve/database and add your database configuration:POSTGRES_USER=your_username POSTGRES_PASSWORD=your_password POSTGRES_DB=qavanin_db
-
Configure web scraping variables in
crawler/main.py
:item_in_page = 25 # Number of items per page start_page = 1 # First page to start scraping last_page = 1 # Last page to scrape
-
Run the scraper: first time you run the crawler it needs to download chromium,sentence-transformers models as its dependency and cuda dependencies so will be little be slow if you ve got any errors relating to chromium on your first tries just try again program is tested and functional sometimes selenium will be buggy
# run this command at root directory /qavanin-ir_ve python crawler/main.py
-
Start the FastAPI server:
# run this command at root directory /qavanin-ir_ve uvicorn api.main:app --reload
-
Access the API documentation at http://localhost:8000/docs.
Find the closest matching documents for a given input text.
Request:
GET /api/get_closest_match?limit=5
Body:
{
"text": "Your search query here"
}
Response:
{
"closest_documents": [
{
"id": 1,
"content": "Matched document content"
}
],
"total_documents": 100
}
Update the content of a specific document. Request:
PUT /api/update_document/1
Body:
{
"text": "Updated document content"
}
Response:
{
"message": "Document updated successfully",
"document": {
"content": "Updated document content",
"updated_at": "2023-05-20T12:00:00Z"
}
}
Delete a specific document. Request:
DELETE /api/delete_document/1
Response:
204 No Content
Retrieve a specific document by its ID.
Request:
GET /api/get_document/1
Response:
{
"message": "Document retrieved successfully",
"id": 1,
"content": "Document content"
}
Database configuration is stored in the .env
file.
Web scraping parameters can be adjusted in crawler/main.py
.
The SentenceTransformer model can be changed in data_processing/vectorizer.py
.
Current docker file is only for database and is located at /qavanin-ir_ve/database/Dockerfile. The dockerfile in the root directory is underdevelopment and is suppose to host DB and API instance
qavanin-ir_ve/
│
├── crawler/
│ ├── web_scraper.py
│ ├── parser.py
│ └── main.py
│
├── data_processing/
│ ├── text_cleaner.py
│ └── vectorizer.py
│
├── database/
│ ├── models.py
│ └── db_operations.py
│ ├── .env
│
├── api/
│ ├── main.py
│ └── endpoints.py
│
├── tests/
│ ├── __init__.py
│ ├── test_api.py
│ ├── test_db_operations.py
│ ├── test_models.py
│ ├── test_parser.py
│ ├── test_text_cleaner.py
│ ├── test_vectorizer.py
│ └── test_web_scraper.py
│
├── requirements.txt
├── Dockerfile
└── README.md
- FastAPI
- SQLAlchemy
- psycopg2-binary
- pgvector
- selenium
- sentence-transformers
For a complete list, refer to the requirements.txt file.
The project uses pytest for automated testing of various components. The test suite covers different modules and functionalities to ensure reliability and correctness.
Tests are located in the tests/
directory and follow this structure:
tests/
├── init.py
├── test_api_endpoints.py
├── test_crawler.py
├── test_db.py
To run the tests, ensure you have pytest installed:
# if is not installed
pip install pytest
Then, from the project root directory, run:
pytest
This command will discover and run all test files in the tests/ directory.
The test suite covers various aspects of the application:
- API functionality (test_api.py)
- Database operations (test_db_operations.py)
- Data models (test_models.py)
- HTML parsing (test_parser.py)
- Text cleaning (test_text_cleaner.py)
- Vector embedding generation (test_vectorizer.py)
- Web scraping functionality (test_web_scraper.py)
The qavanin.ir Scraper and API is designed to be extensible and scalable, with potential improvements to increase its performance, error handling, and usability. Some of the key future enhancements include:
-
Multithreading or Asyncio for Faster Scraping
Implementing multithreading or asynchronous scraping techniques can significantly speed up the scraping process by allowing multiple pages to be scraped simultaneously. Multithreading is possible just need to implement some logic to handle duplication and handle when program gets out of sync for now didn't have the enough time. can use set's and handle errors to not have duplications in db. Sadly currently limited by the website being iran access so no proxy and cdn usage is possible. probably should refactor with playwright to be able to have asynchronous requesting. Other options include having multiple instances of the scraper in docker. -
Improved Error Handling
Adding more robust error handling and recovery mechanisms will ensure smoother scraping even under challenging network conditions or in case of changes to the target website's structure. -
Rate Limiting
Introducing rate limiting will prevent overloading the qavanin.ir website, ensuring compliance with web scraping best practices and avoiding potential blocking by the server. -
Performance Optimization for Document Processing
Optimizing the performance of large-scale document processing will improve response times when dealing with a high volume of legal documents. Caching mechanisms could also be introduced to optimize frequent queries. -
Parallel Processing for Text Cleaning and Vectorization
To handle large batches of documents, implementing parallel processing for text cleaning and vector embedding generation can help improve the overall efficiency of the data pipeline.
Database connection errors are caught and logged. Web scraping failures are handled with retries and logging. API endpoints include proper error responses and status codes. Custom exceptions like DatabaseInitializationError are used for specific error scenarios.
-
qavanin.ir Access: qavanin.ir is an Iran-access website, meaning that it will reject any requests that are not made with an Iranian IP. This might cause problems if your IP is not from Iran.
-
Cloudflare Protection: qavanin.ir is behind Cloudflare and is protected by it. Reducing the delay might cause some problems with interrupting the bot with puzzles and other security measures.
-
Database Initialization: Database initialization might fail if it is not correctly set up. Make sure to check this step before proceeding with other steps.
-
Dependencies: Not having the correct dependencies installed can cause problems. For example, Chromium drivers are required for crawling, and the sentence-transformers library has many dependencies, such as torch and CUDA libraries. Make sure that any errors you encounter are not related to these dependencies.
-
Library Interference: Some libraries, if not installed currently, might cause problems and interfere with each other. Keep this in mind.