This repo contains a library and samples for evaluating the output of an LLM, similar to what we use in a RAG architecture. There are many parameters that affect the quality and style of answers generated by a chat app, such as the system prompt, search parameters, and model parameters.
Table of contents:
- Setting up this project
- Models to test, to evaluate and to generate data
- Generating ground truth data
- Running an evaluation
- Viewing the results
The sample console application includes several uses cases:
- Generate and evaluate a QAs using a LLM
- Evaluate 2 harcoded QAs
- Evaluate a harcoded User Story
- Batch evaluate a list of User Stories loaded from a file
- Generate and evaluate a list of QAs using a LLM
- Manually generate a QA and eval the QA
- Manually evaluate the QA
This is a sample of an user typing a QA, the test model generating the answer for the QA, and the LLMEval.Core library performing the eval.
In this sample, a list of User stories are loaded from a file, and the LLMEval.Core library performing the eval process.
If you open this project in a Dev Container or GitHub Codespaces, it will automatically set up the environment for you. If not, then follow these steps:
- Install .NET 8 or higher
- Install Visual Studio 2022 or Visual Studio Code
It's best to use a GPT-4 model for performing the evaluation, even if your chat app uses GPT-3.5 or another model. You can either use an Azure OpenAI instance, an openai.com instance or any LLM supported by Semantic Kernel.
The KernelFactory class src/LLMEval.Test/KernelFactory.cs
, creates the necessary kernels for evaluation, generating ground data and the LLM to be tested.
The following code, shows how to create 3 different kernels:
// ========================================
// create kernels
// ========================================
var kernelEval = KernelFactory.CreateKernelEval();
var kernelTest = KernelFactory.CreateKernelEval();
var kernelGenData = KernelFactory.CreateKernelGenerateData();
You can use .NET User Secrets to store the necessary information to create the kernels. In example run these commands in bash to store the information related to an Azure OpenAI Service Endpoint:
dotnet user-secrets init
dotnet user-secrets set "AZURE_OPENAI_MODEL" "Azure OpenAI Deployment Name"
dotnet user-secrets set "AZURE_OPENAI_ENDPOINT" "Azure OpenAI Deployment Endpoint"
dotnet user-secrets set "AZURE_OPENAI_KEY" "Azure OpenAI Deployment Key"
Now the KernelFactory class can create a kernel with the following code:
public static Kernel CreateKernelEval()
{
var config = new ConfigurationBuilder().AddUserSecrets<Program>().Build();
var builder = Kernel.CreateBuilder();
builder.AddAzureOpenAIChatCompletion(
config["AZURE_OPENAI_MODEL"],
config["AZURE_OPENAI_ENDPOINT"],
config["AZURE_OPENAI_KEY"]);
return builder.Build();
}
For more details, see the Project Secrets in Development guidance
Semantic Kernel also allows the access to any model that supports the OpenAI Chat Completions API.
The following example creates a kernel using a Llama 3 model deployed and hosted locally using Ollama.
public static Kernel CreateKernelEval()
{
var config = new ConfigurationBuilder().AddUserSecrets<Program>().Build();
var builder = Kernel.CreateBuilder();
builder.AddOpenAIChatCompletion(
modelId: "phi3",
endpoint: new Uri("http://localhost:11434"),
apiKey: "api");
return builder.Build();
}
This article explains how to host and works with local LLMs, Local Models
In order to evaluate new answers, they must be compared to "ground truth" answers: the ideal answer for a particular question. We recommend at least 200 QA pairs if possible.
There are a few ways to get this data:
- Manually curate a set of questions and answers that you consider to be ideal. This is the most accurate, but also the most time-consuming. Make sure your answers include citations in the expected format. This approach requires domain expertise in the data.
- Use the generator class to generate a set of questions and answers. This is the fastest, but may also be the least accurate.
- Use the generator class to generate a set of questions and answers, and then manually curate them, rewriting any answers that are subpar and adding missing citations. This is a good middle ground, and is what we recommend.
Additional tips for ground truth data generation
- Generate more QA pairs than you need, then prune them down manually based on quality and overlap. Remove low quality answers, and remove questions that are too similar to other questions.
- Be aware of the knowledge distribution in the document set, so you effectively sample questions across the knowledge space.
- Once your chat application is live, continually sample live user questions (within accordance to your privacy policy) to make sure you're representing the sorts of questions that users are asking.
This repo includes a set of libraries to process and generate questions and answers from a specific topic.
The following code is part of the main program, in the console application and shows how to generate a QA from a topic. Each QA has a field for the question, the answer and the topic.
// ask for the topic to generate the QAs
var topic = SpectreConsoleOutput.AskForString("Type the topic to generate the QA?");
var qa = await QALLMGenerator.GenerateQA(kernelGenData, topic);
var json = JsonSerializer.Serialize(qa, new JsonSerializerOptions
{
WriteIndented = true
});
SpectreConsoleOutput.DisplayJson(json, "Generated QA using LLM", true);
The demo code also shows the generated QA, in JSON format in the console. The output is similar to this one:
The generator can also generate a collection of QAs. The following code shows how to generate 3 sample QAs and it converts the generated collection to JSON.
// generate a collection of QAs using llms
var llmGenQAs = await QALLMGenerator.GenerateQACollection(kernelGenData, 3);
// convert llmGenQAs to json
var json = JsonSerializer.Serialize(llmGenQAs, new JsonSerializerOptions
{
WriteIndented = true
});
This is a sample for the generated json questions:
{
"Question": "What were the primary causes of World War I?",
"Answer": "The primary causes of World War I were militarism, alliances, imperialism, and nationalism.",
"Topic": "History"
},
{
"Question": "What is the formula to calculate the area of a circle?",
"Answer": "The formula to calculate the area of a circle is A = \u03C0r^2, where A is the area and r is the radius of the circle.",
"Topic": "Math"
},
{
"Question": "Who is the author of the book \u0027To Kill a Mockingbird\u0027?",
"Answer": "Harper Lee",
"Topic": "Literature"
}
Once a QA or a collection of QAs is available, the EvalLLM library can run the evaluation against the collection. The following image shows the console output of 2 different evaluated QAs.
This the necessary code to run the eval of a QA:
var qaProcessor = new QACreator.QACreator(kernelTest);
var qa = new QA
{
Question = "two plus two",
Answer = "'4' or 'four'",
Topic = "Math"
};
var processResult = await qaProcessor.Process(qa);
var results = await batchEval.ProcessSingle(processResult);
results.EvalRunName = "Harcoded QA 1";
SpectreConsoleOutput.DisplayResults(results);
The LLMEval.Core class will use the metrics specified while creating the object. These metrics are custom metrics that we've added based on the python sample: Evaluating a RAG Chat App .
The following code shows how to add the 3 main custom evaluators:
- coherence
- groundedness
- relevance
- lenght
// ========================================
// create LLMEval and add evaluators
// ========================================
var kernelEvalFunctions = kernelEval.CreatePluginFromPromptDirectory("Prompts");
var batchEval = new Core.LLMEval();
batchEval
.AddEvaluator(new PromptScoreEval("coherence", kernelEval, kernelEvalFunctions["coherence"]))
.AddEvaluator(new PromptScoreEval("groundedness", kernelEval, kernelEvalFunctions["groundedness"]))
.AddEvaluator(new PromptScoreEval("relevance", kernelEval, kernelEvalFunctions["relevance"]))
.AddEvaluator(new LenghtEval());
batchEval.SetMeterId("phi3");
The following metrics are implemented very similar to the built-in metrics provided by Azure AI Studio, but use a locally stored prompt. They're a great fit as we don't have access to the built-in metrics.
coherence
: Measures how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language. Based onsrc/LLMEval.Core/_prompts/coherence/skprompt.txt
.relevance
: Assesses the ability of answers to capture the key points of the context. Based onsrc/LLMEval.Core/_prompts/relevance/skprompt.txt
.groundedness
: Assesses the correspondence between claims in an AI-generated answer and the source context, making sure that these claims are substantiated by the context. Based onsrc/LLMEval.Core/_prompts/groundedness/skprompt.txt
.
Azure AI Studio provides a set of Built-In metrics. These metrics are calculated by sending a call to the GPT model, asking it to provide a 1-5 rating, and storing that rating.
gpt_coherence
measures how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language.gpt_relevance
assesses the ability of answers to capture the key points of the context.gpt_groundedness
assesses the correspondence between claims in an AI-generated answer and the source context, making sure that these claims are substantiated by the context.
The results of each evaluation are stored in memory. There are different options to export the results to json or csv. By default the sample console application show the output in a table like this one:
The following code shows an example with the following steps:
- define a source file that stores the sample data
assets/data-15.json
- generate a collection of User Inputs with the file data
- perfom the evaluation
- show the output in the console.
// ========================================
// evaluate a batch of inputs for User Stories from a file
// ========================================
SpectreConsoleOutput.DisplayTitleH2("Processing batch of User Stories");
var fileName = "assets/data-15.json";
Console.WriteLine($"Processing {fileName} ...");
Console.WriteLine("");
// load the sample data
var userStoryCreator = new UserStoryCreator.UserStoryCreator(kernelTest);
var userInputCollection = await UserStoryGenerator.FileProcessor.ProcessUserInputFile(fileName);
var modelOutputCollection = await userStoryCreator.ProcessCollection(userInputCollection);
var results = await batchEval.ProcessCollection(modelOutputCollection);
results.EvalRunName = "User Story collection from file";
SpectreConsoleOutput.DisplayResults(results);
The LLMEval.Outputs library provides the capability to export the results to Json. The following code shows how to export to json, save the result and display the result in the console:
// convert results to json, save the results and display them in the console
var json = LLMEval.Outputs.ExportToJson.CreateJson(results);
LLMEval.Outputs.ExportToJson.SaveJson(results, "results.json");
SpectreConsoleOutput.DisplayJson(json, "User Story collection from file", true);