This project implements a security-focused language model using Retrieval-Augmented Generation (RAG) techniques. The system is built around the Qwen model family and is specifically designed for security-related tasks and analysis.
rag_trained_security/
├── model_training/ # Model fine-tuning and training scripts
│ ├── finetune_qwen.py # Main fine-tuning script for Qwen models
│ └── README.md # Training-specific documentation
├── data_preparation/ # Data processing and preparation scripts
│ └── README.md # Data preparation documentation
├── utils/ # Utility functions and helper scripts
│ └── README.md # Utilities documentation
└── docs/ # Detailed documentation
├── training.md # Training process documentation
├── data_format.md # Data format specifications
└── model_config.md # Model configuration details
- Support for multiple Qwen model variants (7B, 14B, Chat)
- Parameter-Efficient Fine-Tuning (PEFT) with LoRA
- 4-bit quantization for efficient training
- Flexible data input format
- Comprehensive logging and error handling
-
Qwen-7B Base
- Standard 7B parameter model
- Optimal for general security tasks
- Balanced performance and resource usage
-
Qwen-14B
- Larger 14B parameter model
- Enhanced reasoning capabilities
- Suitable for complex security analysis
-
Qwen-7B Chat
- Conversation-optimized 7B model
- Ideal for interactive security applications
- Better response formatting
-
Setup Environment
pip install -r requirements.txt
-
Configure Environment Variables
# Copy the example environment file cp .env.example .env # Edit .env with your configurations nano .env
Key environment variables:
CHROMA_DB_PATH
: Path to ChromaDB storageCHROMA_COLLECTION_NAME
: Name of the ChromaDB collectionOLLAMA_BASE_URL
: Ollama API endpointMODEL_NAME
: Name of the model to useMAX_QUERY_RESULTS
: Number of results to return per queryCHUNK_SIZE
: Size of text chunks for processingCHUNK_OVERLAP
: Overlap between text chunks
-
List Available Models
python model_training/finetune_qwen.py --list-models
-
Train Model
python model_training/finetune_qwen.py --model qwen-7b --training-file your_data.json
- Training Process: Detailed guide on model training
- Data Format: Specifications for training data
- Model Configuration: Model-specific settings
- NVIDIA GPU with 24GB+ VRAM (RTX 4090 or better)
- CUDA 12.4+
- 64GB+ System RAM
- Python 3.10+
The system expects training data in JSON format with specific fields for security-related content. See Data Format for details.
Each model variant has specific configurations for:
- LoRA parameters (rank, alpha, dropout)
- Target modules for fine-tuning
- Quantization settings
- Padding and tokenization
- Fork the repository
- Create a feature branch
- Submit a pull request with detailed description
Cory Kujawski [email protected]
- Qwen model team
- PEFT library contributors
- Hugging Face team