qavanin.ir Scraper and API

Navigation Table

Overview
Features
Installation
- Docker Setup
- Python setup
Usage
- Running the Scraper
- Starting the API Server
API Endpoints
Configuration
Project Structure
Dependencies
Testing
Future Improvements
Error Handling
Possible issues

Overview

The qavanin.ir Scraper and API is a comprehensive solution for extracting, processing, and analyzing legal documents from the qavanin.ir website. It combines web scraping capabilities with natural language processing and a robust API to provide easy access to legal information.

Features

Web Scraping: Crawls multiple pages from qavanin.ir, extracting legal documents.
Text Processing: Cleans HTML content and converts it to a structured Markdown format.
Vector Embeddings: Generates vector embeddings for processed text using SentenceTransformer.
Database Storage: Stores original text, processed text, and vector embeddings in PostgreSQL with pgvector extension.
FastAPI Endpoints: Provides a RESTful API for querying similar content, updating documents, and more.
Docker Support: Easily deploy and run the application using Docker.

Installation

Docker Setup

Clone the repository:

git clone https://github.com/MSC72m/qavanin.ir_ve.git
cd qavanin-ir_ve/qavanin-ir_ve/database

Build the Docker image:
```
docker build -t pgvector_db .
```

Run the Docker container:

docker run -p 5432:5432 -e POSTGRES_USER=test -e POSTGRES_PASSWORD=test -e POSTGRES_DB=pg-test pgvector_db

Python setup

Create a virtual environment and activate it:

# run this command at root directory /qavanin-ir_ve cd.. if needed
python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the required packages:
```
pip install -r requirements.txt
```
Set up your PostgreSQL database and install the pgvector extension.

Create a .env file in /qavanin-ir_ve/database and add your database configuration:

POSTGRES_USER=your_username
POSTGRES_PASSWORD=your_password
POSTGRES_DB=qavanin_db

Usage

Running the Scraper

Configure web scraping variables in crawler/main.py:

item_in_page = 25  # Number of items per page
start_page = 1     # First page to start scraping
last_page = 1      # Last page to scrape

Run the scraper: first time you run the crawler it needs to download chromium,sentence-transformers models as its dependency and cuda dependencies so will be little be slow if you ve got any errors relating to chromium on your first tries just try again program is tested and functional sometimes selenium will be buggy
```
# run this command at root directory /qavanin-ir_ve
python crawler/main.py
```

Starting the API Server

Start the FastAPI server:

# run this command at root directory /qavanin-ir_ve
uvicorn api.main:app --reload

Access the API documentation at http://localhost:8000/docs.

API Endpoints

GET /get_closest_match

Find the closest matching documents for a given input text.

Request:

  GET /api/get_closest_match?limit=5

Body:

{
  "text": "Your search query here"
}

Response:

{
  "closest_documents": [
    {
      "id": 1,
      "content": "Matched document content"
    }
  ],
  "total_documents": 100
}

PUT /update_document/{document_id}

Update the content of a specific document. Request:

PUT /api/update_document/1

Body:

{
  "text": "Updated document content"
}

Response:

{
  "message": "Document updated successfully",
  "document": {
    "content": "Updated document content",
    "updated_at": "2023-05-20T12:00:00Z"
  }
}

DELETE /delete_document/{document_id}

Delete a specific document. Request:

  DELETE /api/delete_document/1

Response:

204 No Content

GET /get_document/{document_id}

Retrieve a specific document by its ID.

Request:

GET /api/get_document/1

Response:

{
  "message": "Document retrieved successfully",
  "id": 1,
  "content": "Document content"
}

Configuration

Database configuration is stored in the .env file. Web scraping parameters can be adjusted in crawler/main.py. The SentenceTransformer model can be changed in data_processing/vectorizer.py. Current docker file is only for database and is located at /qavanin-ir_ve/database/Dockerfile. The dockerfile in the root directory is underdevelopment and is suppose to host DB and API instance

Project Structure

qavanin-ir_ve/
│
├── crawler/
│   ├── web_scraper.py
│   ├── parser.py
│   └── main.py
│
├── data_processing/
│   ├── text_cleaner.py
│   └── vectorizer.py
│
├── database/
│   ├── models.py
│   └── db_operations.py    
│   ├── .env
│ 
├── api/
│   ├── main.py
│   └── endpoints.py
│
├── tests/
│   ├── __init__.py
│   ├── test_api.py
│   ├── test_db_operations.py
│   ├── test_models.py
│   ├── test_parser.py
│   ├── test_text_cleaner.py
│   ├── test_vectorizer.py
│   └── test_web_scraper.py
│
├── requirements.txt
├── Dockerfile
└── README.md

Dependencies

Main dependencies include:

FastAPI
SQLAlchemy
psycopg2-binary
pgvector
selenium
sentence-transformers

For a complete list, refer to the requirements.txt file.

Testing

The project uses pytest for automated testing of various components. The test suite covers different modules and functionalities to ensure reliability and correctness.

Test Structure

Tests are located in the tests/ directory and follow this structure:

tests/
├── init.py
├── test_api_endpoints.py
├── test_crawler.py
├── test_db.py

Running Tests

To run the tests, ensure you have pytest installed:

# if is not installed
pip install pytest

Then, from the project root directory, run:

pytest

This command will discover and run all test files in the tests/ directory.

Test Coverage

The test suite covers various aspects of the application:

API functionality (test_api.py)
Database operations (test_db_operations.py)
Data models (test_models.py)
HTML parsing (test_parser.py)
Text cleaning (test_text_cleaner.py)
Vector embedding generation (test_vectorizer.py)
Web scraping functionality (test_web_scraper.py)

Future Improvements

The qavanin.ir Scraper and API is designed to be extensible and scalable, with potential improvements to increase its performance, error handling, and usability. Some of the key future enhancements include:

Multithreading or Asyncio for Faster Scraping
Implementing multithreading or asynchronous scraping techniques can significantly speed up the scraping process by allowing multiple pages to be scraped simultaneously. Multithreading is possible just need to implement some logic to handle duplication and handle when program gets out of sync for now didn't have the enough time. can use set's and handle errors to not have duplications in db. Sadly currently limited by the website being iran access so no proxy and cdn usage is possible. probably should refactor with playwright to be able to have asynchronous requesting. Other options include having multiple instances of the scraper in docker.
Improved Error Handling
Adding more robust error handling and recovery mechanisms will ensure smoother scraping even under challenging network conditions or in case of changes to the target website's structure.
Rate Limiting
Introducing rate limiting will prevent overloading the qavanin.ir website, ensuring compliance with web scraping best practices and avoiding potential blocking by the server.
Performance Optimization for Document Processing
Optimizing the performance of large-scale document processing will improve response times when dealing with a high volume of legal documents. Caching mechanisms could also be introduced to optimize frequent queries.
Parallel Processing for Text Cleaning and Vectorization
To handle large batches of documents, implementing parallel processing for text cleaning and vector embedding generation can help improve the overall efficiency of the data pipeline.

Error Handling

The application includes comprehensive error handling:

Database connection errors are caught and logged. Web scraping failures are handled with retries and logging. API endpoints include proper error responses and status codes. Custom exceptions like DatabaseInitializationError are used for specific error scenarios.

Possible Issues

qavanin.ir Access: qavanin.ir is an Iran-access website, meaning that it will reject any requests that are not made with an Iranian IP. This might cause problems if your IP is not from Iran.
Cloudflare Protection: qavanin.ir is behind Cloudflare and is protected by it. Reducing the delay might cause some problems with interrupting the bot with puzzles and other security measures.
Database Initialization: Database initialization might fail if it is not correctly set up. Make sure to check this step before proceeding with other steps.
Dependencies: Not having the correct dependencies installed can cause problems. For example, Chromium drivers are required for crawling, and the sentence-transformers library has many dependencies, such as torch and CUDA libraries. Make sure that any errors you encounter are not related to these dependencies.
Library Interference: Some libraries, if not installed currently, might cause problems and interfere with each other. Keep this in mind.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
qavanin-ir_ve		qavanin-ir_ve
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

qavanin.ir Scraper and API

Navigation Table

Overview

Features

Installation

Docker Setup

Python setup

Usage

Running the Scraper

Starting the API Server

API Endpoints

GET /get_closest_match

PUT /update_document/{document_id}

DELETE /delete_document/{document_id}

GET /get_document/{document_id}

Configuration

Project Structure

Dependencies

Main dependencies include:

Testing

Test Structure

Running Tests

Test Coverage

Future Improvements

Error Handling

The application includes comprehensive error handling:

Possible Issues

About

Releases

Packages

Contributors 2

Languages

MSC72m/qavanin.ir_ve

Folders and files

Latest commit

History

Repository files navigation

qavanin.ir Scraper and API

Navigation Table

Overview

Features

Installation

Docker Setup

Python setup

Usage

Running the Scraper

Starting the API Server

API Endpoints

GET /get_closest_match

PUT /update_document/{document_id}

DELETE /delete_document/{document_id}

GET /get_document/{document_id}

Configuration

Project Structure

Dependencies

Main dependencies include:

Testing

Test Structure

Running Tests

Test Coverage

Future Improvements

Error Handling

The application includes comprehensive error handling:

Possible Issues

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages