This repository contains the data and evaluation scripts for the paper CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence, accepted at NeurIPS 2024. CTIBench is a comprehensive suite of benchmark tasks and datasets designed to evaluate Large Language Models (LLMs) in the field of Cyber Threat Intelligence (CTI).
Dataset details can be found at huggingface: https://huggingface.co/datasets/AI4Sec/cti-bench
evaluation
directory contains scripts to evaluate model performance and the response for 5 LLMs - ChatGPT3.5, ChatGPT4, Gemini-1.5, LLAMA3-70B, LLAMA3-8B.
logs
directory contains the unprocessed response from ChatGPT3.5, ChatGPT4 and Gemini-1.5 for the tasks.