Name	Name	Last commit message	Last commit date
parent directory ..
LLMEval.Core	LLMEval.Core
LLMEval.Data	LLMEval.Data
LLMEval.Outputs	LLMEval.Outputs
LLMEval.Test	LLMEval.Test
QACreator	QACreator
QAGenerator	QAGenerator
UserStoryCreator	UserStoryCreator
UserStoryGenerator	UserStoryGenerator
img	img
README.md	README.md
dotnet-llm-eval.sln	dotnet-llm-eval.sln

Evaluating a LLM with .NET

This repo contains a library and samples for evaluating the output of an LLM, similar to what we use in a RAG architecture. There are many parameters that affect the quality and style of answers generated by a chat app, such as the system prompt, search parameters, and model parameters.

Table of contents:

Setting up this project
Models to test, to evaluate and to generate data
Generating ground truth data
Running an evaluation
Viewing the results

The sample console application includes several uses cases:

Generate and evaluate a QAs using a LLM
Evaluate 2 harcoded QAs
Evaluate a harcoded User Story
Batch evaluate a list of User Stories loaded from a file
Generate and evaluate a list of QAs using a LLM
Manually generate a QA and eval the QA
Manually evaluate the QA

This is a sample of an user typing a QA, the test model generating the answer for the QA, and the LLMEval.Core library performing the eval.

In this sample, a list of User stories are loaded from a file, and the LLMEval.Core library performing the eval process.

Setting up this project

If you open this project in a Dev Container or GitHub Codespaces, it will automatically set up the environment for you. If not, then follow these steps:

Install .NET 8 or higher
Install Visual Studio 2022 or Visual Studio Code

Models to test, to evaluate and to generate data

It's best to use a GPT-4 model for performing the evaluation, even if your chat app uses GPT-3.5 or another model. You can either use an Azure OpenAI instance, an openai.com instance or any LLM supported by Semantic Kernel.

The KernelFactory class src/LLMEval.Test/KernelFactory.cs, creates the necessary kernels for evaluation, generating ground data and the LLM to be tested.

The following code, shows how to create 3 different kernels:

// ========================================
// create kernels
// ========================================
var kernelEval = KernelFactory.CreateKernelEval();
var kernelTest = KernelFactory.CreateKernelEval();
var kernelGenData = KernelFactory.CreateKernelGenerateData();

Creating Kernels for Azure OpenAI Services or OpenAI APIs

You can use .NET User Secrets to store the necessary information to create the kernels. In example run these commands in bash to store the information related to an Azure OpenAI Service Endpoint:

dotnet user-secrets init
dotnet user-secrets set "AZURE_OPENAI_MODEL" "Azure OpenAI Deployment Name"
dotnet user-secrets set "AZURE_OPENAI_ENDPOINT" "Azure OpenAI Deployment Endpoint"
dotnet user-secrets set "AZURE_OPENAI_KEY" "Azure OpenAI Deployment Key"

Now the KernelFactory class can create a kernel with the following code:

public static Kernel CreateKernelEval()
{
    var config = new ConfigurationBuilder().AddUserSecrets<Program>().Build();

    var builder = Kernel.CreateBuilder();

    builder.AddAzureOpenAIChatCompletion(
        config["AZURE_OPENAI_MODEL"],
        config["AZURE_OPENAI_ENDPOINT"],
        config["AZURE_OPENAI_KEY"]);

    return builder.Build();
}

For more details, see the Project Secrets in Development guidance

Creating Kernels to access Open Source LLMs, like Phi-3 or Llama 3

Semantic Kernel also allows the access to any model that supports the OpenAI Chat Completions API.

The following example creates a kernel using a Llama 3 model deployed and hosted locally using Ollama.

public static Kernel CreateKernelEval()
{
    var config = new ConfigurationBuilder().AddUserSecrets<Program>().Build();

    var builder = Kernel.CreateBuilder();

        builder.AddOpenAIChatCompletion(
            modelId: "phi3",
            endpoint: new Uri("http://localhost:11434"),
            apiKey: "api");

    return builder.Build();
}

This article explains how to host and works with local LLMs, Local Models

Generating ground truth data

In order to evaluate new answers, they must be compared to "ground truth" answers: the ideal answer for a particular question. We recommend at least 200 QA pairs if possible.

There are a few ways to get this data:

Manually curate a set of questions and answers that you consider to be ideal. This is the most accurate, but also the most time-consuming. Make sure your answers include citations in the expected format. This approach requires domain expertise in the data.
Use the generator class to generate a set of questions and answers. This is the fastest, but may also be the least accurate.
Use the generator class to generate a set of questions and answers, and then manually curate them, rewriting any answers that are subpar and adding missing citations. This is a good middle ground, and is what we recommend.

Additional tips for ground truth data generation

Generate more QA pairs than you need, then prune them down manually based on quality and overlap. Remove low quality answers, and remove questions that are too similar to other questions.
Be aware of the knowledge distribution in the document set, so you effectively sample questions across the knowledge space.
Once your chat application is live, continually sample live user questions (within accordance to your privacy policy) to make sure you're representing the sorts of questions that users are asking.

Generate a QA from a Topic

This repo includes a set of libraries to process and generate questions and answers from a specific topic.

The following code is part of the main program, in the console application and shows how to generate a QA from a topic. Each QA has a field for the question, the answer and the topic.

// ask for the topic to generate the QAs
var topic = SpectreConsoleOutput.AskForString("Type the topic to generate the QA?");

var qa = await QALLMGenerator.GenerateQA(kernelGenData, topic);
var json = JsonSerializer.Serialize(qa, new JsonSerializerOptions
{
    WriteIndented = true
});
SpectreConsoleOutput.DisplayJson(json, "Generated QA using LLM", true);

The demo code also shows the generated QA, in JSON format in the console. The output is similar to this one:

Generate a collection of QAs

The generator can also generate a collection of QAs. The following code shows how to generate 3 sample QAs and it converts the generated collection to JSON.

// generate a collection of QAs using llms
var llmGenQAs = await QALLMGenerator.GenerateQACollection(kernelGenData, 3);

// convert llmGenQAs to json
var json = JsonSerializer.Serialize(llmGenQAs, new JsonSerializerOptions
{
    WriteIndented = true
});

This is a sample for the generated json questions:

{
    "Question": "What were the primary causes of World War I?",
    "Answer": "The primary causes of World War I were militarism, alliances, imperialism, and nationalism.",
    "Topic": "History"
},
{
    "Question": "What is the formula to calculate the area of a circle?",
    "Answer": "The formula to calculate the area of a circle is A = \u03C0r^2, where A is the area and r is the radius of the circle.",
    "Topic": "Math"
},
{
    "Question": "Who is the author of the book \u0027To Kill a Mockingbird\u0027?",
    "Answer": "Harper Lee",
    "Topic": "Literature"
}

Running an evaluation

Once a QA or a collection of QAs is available, the EvalLLM library can run the evaluation against the collection. The following image shows the console output of 2 different evaluated QAs.

This the necessary code to run the eval of a QA:

var qaProcessor = new QACreator.QACreator(kernelTest);
var qa = new QA
{
    Question = "two plus two",
    Answer = "'4' or 'four'",
    Topic = "Math"
};

var processResult = await qaProcessor.Process(qa);
var results = await batchEval.ProcessSingle(processResult);
results.EvalRunName = "Harcoded QA 1";
SpectreConsoleOutput.DisplayResults(results);

Specifying the evaluation metrics

The LLMEval.Core class will use the metrics specified while creating the object. These metrics are custom metrics that we've added based on the python sample: Evaluating a RAG Chat App .

The following code shows how to add the 3 main custom evaluators:

coherence
groundedness
relevance
lenght

// ========================================
// create LLMEval and add evaluators
// ========================================
var kernelEvalFunctions = kernelEval.CreatePluginFromPromptDirectory("Prompts");
var batchEval = new Core.LLMEval();

batchEval
    .AddEvaluator(new PromptScoreEval("coherence", kernelEval, kernelEvalFunctions["coherence"]))
    .AddEvaluator(new PromptScoreEval("groundedness", kernelEval, kernelEvalFunctions["groundedness"]))
    .AddEvaluator(new PromptScoreEval("relevance", kernelEval, kernelEvalFunctions["relevance"]))
    .AddEvaluator(new LenghtEval());
batchEval.SetMeterId("phi3");

Custom metrics

The following metrics are implemented very similar to the built-in metrics provided by Azure AI Studio, but use a locally stored prompt. They're a great fit as we don't have access to the built-in metrics.

coherence: Measures how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language. Based on src/LLMEval.Core/_prompts/coherence/skprompt.txt.
relevance: Assesses the ability of answers to capture the key points of the context. Based on src/LLMEval.Core/_prompts/relevance/skprompt.txt.
groundedness: Assesses the correspondence between claims in an AI-generated answer and the source context, making sure that these claims are substantiated by the context. Based on src/LLMEval.Core/_prompts/groundedness/skprompt.txt.

Azure AI Studio - Built-in metrics

Azure AI Studio provides a set of Built-In metrics. These metrics are calculated by sending a call to the GPT model, asking it to provide a 1-5 rating, and storing that rating.

gpt_coherence measures how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language.
gpt_relevance assesses the ability of answers to capture the key points of the context.
gpt_groundedness assesses the correspondence between claims in an AI-generated answer and the source context, making sure that these claims are substantiated by the context.

Viewing the results

The results of each evaluation are stored in memory. There are different options to export the results to json or csv. By default the sample console application show the output in a table like this one:

The following code shows an example with the following steps:

define a source file that stores the sample data assets/data-15.json
generate a collection of User Inputs with the file data
perfom the evaluation
show the output in the console.

// ========================================
// evaluate a batch of inputs for User Stories from a file
// ========================================
SpectreConsoleOutput.DisplayTitleH2("Processing batch of User Stories");
var fileName = "assets/data-15.json";
Console.WriteLine($"Processing {fileName} ...");
Console.WriteLine("");

// load the sample data
var userStoryCreator = new UserStoryCreator.UserStoryCreator(kernelTest);
var userInputCollection = await UserStoryGenerator.FileProcessor.ProcessUserInputFile(fileName);

var modelOutputCollection = await userStoryCreator.ProcessCollection(userInputCollection);
var results = await batchEval.ProcessCollection(modelOutputCollection);
results.EvalRunName = "User Story collection from file";
SpectreConsoleOutput.DisplayResults(results);

The LLMEval.Outputs library provides the capability to export the results to Json. The following code shows how to export to json, save the result and display the result in the console:

// convert results to json, save the results and display them in the console
var json = LLMEval.Outputs.ExportToJson.CreateJson(results);
LLMEval.Outputs.ExportToJson.SaveJson(results, "results.json");
SpectreConsoleOutput.DisplayJson(json, "User Story collection from file", true);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-eval

llm-eval

README.md

Evaluating a LLM with .NET

Setting up this project

Models to test, to evaluate and to generate data

Creating Kernels for Azure OpenAI Services or OpenAI APIs

Creating Kernels to access Open Source LLMs, like Phi-3 or Llama 3

Generating ground truth data

Generate a QA from a Topic

Generate a collection of QAs

Running an evaluation

Specifying the evaluation metrics

Custom metrics

Azure AI Studio - Built-in metrics

Viewing the results

Files

llm-eval

Directory actions

More options

Directory actions

More options

Latest commit

History

llm-eval

Folders and files

parent directory

README.md

Evaluating a LLM with .NET

Setting up this project

Models to test, to evaluate and to generate data

Creating Kernels for Azure OpenAI Services or OpenAI APIs

Creating Kernels to access Open Source LLMs, like Phi-3 or Llama 3

Generating ground truth data

Generate a QA from a Topic

Generate a collection of QAs

Running an evaluation

Specifying the evaluation metrics

Custom metrics

Azure AI Studio - Built-in metrics

Viewing the results