Despite tremendous advancements in large language models (LLMs) over recent years, a notably urgent challenge for their practical deployment is the phenomenon of hallucination, where the model fabricates facts and produces non-factual statements. In response, we propose PoLLMgraph—a Polygraph for LLMs—as an effective model-based white-box detection and forecasting approach. PoLLMgraph distinctly differs from the large body of existing research that concentrates on addressing such challenges through black-box evaluations. In particular, we demonstrate that hallucination can be effectively detected by analyzing the LLM's internal state transition dynamics during generation via tractable probabilistic models. Experimental results on various open-source LLMs confirm the efficacy of PoLLMgraph, outperforming state-of-the-art methods by a considerable margin, evidenced by over 20% improvement in AUC-ROC on common benchmarking datasets like TruthfulQA. Our work paves a new way for model-based white-box analysis of LLMs, motivating the research community to further explore, understand, and refine the intricate dynamics of LLM behaviors
- Ensure you have Python 3.8+ installed.
- Clone this repository:
git clone <repository-link>
- Navigate to the project directory and set up a virtual environment:
cd pollmgraph conda create -n env_name python=3.8
- Activate the virtual environment:
conda activate env_name
- Install the necessary dependencies:
pip install -r requirements.txt
To evaluate the hallucination detection effectiveness, you first need to initialize the MetricsAppEvalCollections
class. This class is responsible for collecting various metrics on evaluating the detection performance:
eval_obj = MetricsAppEvalCollections(
state_abstract_args_obj,
prob_args_obj,
train_instances,
val_instances,
test_instances,
)
Where:
state_abstract_args_obj
: A namespace object containing arguments related to state abstraction (e.g., dataset name, block index, info type).prob_args_obj
: A namespace object containing arguments related to probability calculations (e.g., dataset, PCA dimension, model type).train_instances
,val_instances
,test_instances
: The data instances for training, validation, and testing.
Once the MetricsAppEvalCollections
object is initialized, you can then calculate various metrics. Here are some examples:
-
Evaluating the model:
aucroc, accuracy, f1_score, _, _, hallucination_threshold = eval_obj.get_eval_result()
-
Calculating entropy:
entropy = eval_obj.entropy()
- Purpose: TruthfulQA is designed to evaluate the truthfulness of Large Language Models (LLMs) in generating answers to questions.
- Composition: The dataset contains 817 questions across 38 categories of potential falsehoods, such as misconceptions and fiction.
- Truth Assessment: Answers' truthfulness is judged using fine-tuned GPT-3-13B models, classifying each response as true or false.
Before utilizing the TruthfulQA dataset, certain preparatory steps are required:
-
Model Fine-Tuning:
- Follow the guide on Inference-Time Intervention: Eliciting Truthful Answers from a Language Model to create GPT-JUDGE, a fine-tuned GPT-3 model.
-
Dataset Preparation:
- Run the
add_scores_to_truthful_qa.py
script to process the dataset. - Make sure to update the
file_name
andfile_with_score
variables in the script with the correct file paths.
Execute the following command in your terminal:
python add_scores_to_truthful_qa.py
- Run the
The process of abstracting the behavior and properties of a system into a simplified representation that retains only the essential characteristics of the original system. In the context of this framework, model abstraction is done based on state and probabilistic models.
-
Purpose: Provides a base for creating probabilistic models based on abstracted states.
-
Usage Examples:
# Initialize the ProbabilisticModel prob_model = ProbabilisticModel(args) # Evaluate LLM performance on a dataset task prob_model.eval_llm_performance_on_dataset_task() # Compose scores with ground truths prob_model.compose_scores_with_groundtruths_pair()
-
Purpose: Extracts abstract states from provided data instances.
-
Usage Examples:
# Initialize the AbstractStateExtraction state_extractor = AbstractStateExtraction(args) # Perform PCA on data state_extractor.perform_pca() # (Additional method usage examples would be included if available in the file)
Metrics provide a quantitative measure to evaluate the performance and characteristics of models. In our framework, metrics evaluate the quality and behavior of abstracted models.
-
Purpose: Acts as a central utility for metric evaluations based on state abstractions.
-
Usage Examples:
# Initialize the MetricsAppEvalCollections metrics_evaluator = MetricsAppEvalCollections(args_obj1, args_obj2, train_data, val_data, test_data) # Retrieve evaluation results aucroc, accuracy, f1_score, _, _, _ = metrics_evaluator.get_eval_result() # Calculate the preciseness of predictions preciseness_mean, preciseness_max = metrics_evaluator.preciseness()
To run a whole PoLLMGraph(MM) pipeline
python demo.py
@inproceedings{zhu-etal-2024-pollmgraph,
title = "{P}o{LLM}graph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics",
author = "Zhu, Derui and Chen, Dingfan and Li, Qing and Chen, Zongxiong and Ma, Lei and Grossklags, Jens and Fritz, Mario",
booktitle = "Findings of the Association for Computational Linguistics: NAACL 2024",
publisher = "Association for Computational Linguistics",
}