- Dataset generation: CVSS are now extracted from GitHub security advisories.
- Trainers: Support of roberta-base for the text classifier with improved settings for TrainingArguments.
- Validators: Validator for severity classification.
- Introduced a new trainer to automatically classify vulnerabilities based on their descriptions,
even when CVSS scores are unavailable. - Added CVSS parsing to the dataset generation script.
- Refactored the project structure for better organization.
- Improved CPE parsing.
- Enhanced the dataset generation script.
- Optimized the trainer for text generation on vulnerability descriptions.
- Improved command-line argument parsing.
- Improved the process of pushing the tokenizer and trainer to Hugging Face.
Fixed configuration module name.
Added support of configuration file.
The dataset generation step now uses data from GitHub Advisories, and the VulnExtractor cleans the summary and details fields.
Dataset generation: allow specifying a commit message when uploading to Hugging Face.
Validation: Added a simple validation script for model optimized for text generation. The script is able to pull a model and send tasks via a Pipeline
For the training step: added the choices of model: gpt2, distilgpt2, meta-llama/Llama-3.3-70B-Instruct, distilbert-base-uncased
Various improvements to the command line parsing.
Added a trainer. Experimenting distilbert-base-uncased (AutoModelForMaskedLM) and gpt2 (AutoModelForCausalLM). The goal is to generate text.
Various improvements to the dataset generator. And added a command line parser.
First release with upload of datasets to HuggingFace.
Datasets are build based on NIST data with enrichment from FKIE and vulnrichment