[Question] AI Evaluations & Testing #383

gyliu513 · 2025-01-08T19:12:05Z

Doing some test for https://docs.langtrace.ai/features/evaluations and found some problems.

When I create a new dataset, I can see I need to input parameters as follows, but only Input and Expected Output are required.

Then I create a bad dataset with ID as cm4lrz7tq00075jmgkdtlq6w4. The answer to Does RTP NC US has an IBM office should be Yes. This is a bad dataset.

Then I create a sample eval program as below and the file named as example_eval.py:

from inspect_ai import Task, task
from inspect_ai.dataset import csv_dataset
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import self_critique, generate

@task
def example_eval():
    try:
        dataset = csv_dataset("langtracefs://cm4lrz7tq00075jmgkdtlq6w4")
        plan = [
            generate(),
            self_critique(model="openai/gpt-4o")
        ]
        scorer = model_graded_fact()

        return Task(
            dataset=dataset,
            plan=plan,
            scorer=scorer
        )
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

And then run the above program as below:

export INSPECT_LOG_FORMAT=json
export OPENAI_API_KEY="sk-..."
inspect eval example_eval.py --model openai/gpt-3.5-turbo --log-dir langtracefs://cm4lrz7tq00075jmgkdtlq6w4

here is the output

(bedrock) gyliu513@Guangyas-MacBook-Pro langtrace % inspect eval example_eval.py --model openai/gpt-3.5-turbo --log-dir langtracefs://cm4lrz7tq00075jmgkdtlq6w4
Fetching dataset with id: cm4lrz7tq00075jmgkdtlq6w4 from Langtrace
Successfully fetched dataset with id: cm4lrz7tq00075jmgkdtlq6w4 from Langtrace
Sending results to Langtrace for dataset: cm4lrz7tq00075jmgkdtlq6w4

Results sent to Langtrace successfully.

╭─ example_eval (1 sample): openai/gpt-3.5-turbo ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ total time:                                               0:00:03                                                                                                                                     dataset: cm4lrz7tq00075jmgkdtlq6w4 │
│ openai/gpt-3.5-turbo                                      553 tokens [I: 438, O: 115]                                                                                                                                                    │
│ openai/gpt-4o                                             108 tokens [I: 100, O: 8]                                                                                                                                                      │
│                                                                                                                                                                                                                                          │
│ accuracy: 0  stderr: 0                                                                                                                                                                                                                   │
│                                                                                                                                                                                                                                          │
│ Log: langtracefs://cm4lrz7tq00075jmgkdtlq6w4/2025-01-08T13-38-44-05-00_example-eval_kTwLzEvYS3BZjxbyoFpyRs.json                                                                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

From my understanding: openai/gpt-3.5-turbo will be used to generate the answer for the dataset, and model openai/gpt-4o will be used to do self critique work.

The detail info of evaluation is as below:

Here are the questions:

What is the use of dataset? Seems the evaluation only used the input but it did not use expected output. From the above screenshot, we can see the evaluation plan is generate then self_critique. It does not evaluate based on the dataset expected output, is this right? As far as I can see, the dataset should be treated as golden signal, and the evaluation should be based on dataset. Can you help explain what is the use of expected output in dataset?
From above picture, we can see self_critique actually used more token compared with generate, but why does the evaluation result show we used more tokens with openai/gpt-3.5-turbo which handling the generate?

openai/gpt-3.5-turbo                                      553 tokens [I: 438, O: 115]                                                                                                                                                    
openai/gpt-4o                                             108 tokens [I: 100, O: 8]

Do you have a list of values for model grade fact? from my test, it is I, what does I means?

Thanks!

@karthikscale3 ^^

The text was updated successfully, but these errors were encountered:

karthikscale3 · 2025-01-08T19:18:16Z

@gyliu513 - Thanks for the detailed question. Yes, we are aware this is a point of confusion. The expected output can be used purely for evaluation. Let me respond to you with some code samples and exact procedure to follow in a bit.

github-actions · 2025-01-23T00:22:16Z

This issue has been automatically marked as stale due to inactivity. It will be closed in 3 days if no further activity occurs.

github-actions · 2025-02-12T00:22:53Z

This issue has been automatically marked as stale due to inactivity. It will be closed in 3 days if no further activity occurs.

github-actions bot added the stale label Jan 23, 2025

karthikscale3 removed the stale label Jan 28, 2025

github-actions bot added the stale label Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] AI Evaluations & Testing #383

[Question] AI Evaluations & Testing #383

gyliu513 commented Jan 8, 2025 •

edited

Loading

karthikscale3 commented Jan 8, 2025

github-actions bot commented Jan 23, 2025

github-actions bot commented Feb 12, 2025

[Question] AI Evaluations & Testing #383

[Question] AI Evaluations & Testing #383

Comments

gyliu513 commented Jan 8, 2025 • edited Loading

karthikscale3 commented Jan 8, 2025

github-actions bot commented Jan 23, 2025

github-actions bot commented Feb 12, 2025

gyliu513 commented Jan 8, 2025 •

edited

Loading