Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] AI Evaluations & Testing #383

Open
gyliu513 opened this issue Jan 8, 2025 · 3 comments
Open

[Question] AI Evaluations & Testing #383

gyliu513 opened this issue Jan 8, 2025 · 3 comments
Labels

Comments

@gyliu513
Copy link

gyliu513 commented Jan 8, 2025

Doing some test for https://docs.langtrace.ai/features/evaluations and found some problems.

  1. When I create a new dataset, I can see I need to input parameters as follows, but only Input and Expected Output are required.
Screenshot 2025-01-08 at 1 57 37 PM
  1. Then I create a bad dataset with ID as cm4lrz7tq00075jmgkdtlq6w4. The answer to Does RTP NC US has an IBM office should be Yes. This is a bad dataset.
Screenshot 2025-01-08 at 1 58 54 PM
  1. Then I create a sample eval program as below and the file named as example_eval.py:
from inspect_ai import Task, task
from inspect_ai.dataset import csv_dataset
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import self_critique, generate

@task
def example_eval():
    try:
        dataset = csv_dataset("langtracefs://cm4lrz7tq00075jmgkdtlq6w4")
        plan = [
            generate(),
            self_critique(model="openai/gpt-4o")
        ]
        scorer = model_graded_fact()

        return Task(
            dataset=dataset,
            plan=plan,
            scorer=scorer
        )
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

And then run the above program as below:

export INSPECT_LOG_FORMAT=json
export OPENAI_API_KEY="sk-..."
inspect eval example_eval.py --model openai/gpt-3.5-turbo --log-dir langtracefs://cm4lrz7tq00075jmgkdtlq6w4

here is the output

(bedrock) gyliu513@Guangyas-MacBook-Pro langtrace % inspect eval example_eval.py --model openai/gpt-3.5-turbo --log-dir langtracefs://cm4lrz7tq00075jmgkdtlq6w4
Fetching dataset with id: cm4lrz7tq00075jmgkdtlq6w4 from Langtrace
Successfully fetched dataset with id: cm4lrz7tq00075jmgkdtlq6w4 from Langtrace
Sending results to Langtrace for dataset: cm4lrz7tq00075jmgkdtlq6w4

Results sent to Langtrace successfully.

╭─ example_eval (1 sample): openai/gpt-3.5-turbo ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ total time:                                               0:00:03                                                                                                                                     dataset: cm4lrz7tq00075jmgkdtlq6w4 │
│ openai/gpt-3.5-turbo                                      553 tokens [I: 438, O: 115]                                                                                                                                                    │
│ openai/gpt-4o                                             108 tokens [I: 100, O: 8]                                                                                                                                                      │
│                                                                                                                                                                                                                                          │
│ accuracy: 0  stderr: 0                                                                                                                                                                                                                   │
│                                                                                                                                                                                                                                          │
│ Log: langtracefs://cm4lrz7tq00075jmgkdtlq6w4/2025-01-08T13-38-44-05-00_example-eval_kTwLzEvYS3BZjxbyoFpyRs.json                                                                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

From my understanding: openai/gpt-3.5-turbo will be used to generate the answer for the dataset, and model openai/gpt-4o will be used to do self critique work.

The detail info of evaluation is as below:

Screenshot 2025-01-08 at 2 03 17 PM

Screenshot 2025-01-08 at 2 05 32 PM

Here are the questions:

  1. What is the use of dataset? Seems the evaluation only used the input but it did not use expected output. From the above screenshot, we can see the evaluation plan is generate then self_critique. It does not evaluate based on the dataset expected output, is this right? As far as I can see, the dataset should be treated as golden signal, and the evaluation should be based on dataset. Can you help explain what is the use of expected output in dataset?
  2. From above picture, we can see self_critique actually used more token compared with generate, but why does the evaluation result show we used more tokens with openai/gpt-3.5-turbo which handling the generate?
openai/gpt-3.5-turbo                                      553 tokens [I: 438, O: 115]                                                                                                                                                    
openai/gpt-4o                                             108 tokens [I: 100, O: 8]         
  1. Do you have a list of values for model grade fact? from my test, it is I, what does I means?

Thanks!

@karthikscale3 ^^

@karthikscale3
Copy link
Contributor

@gyliu513 - Thanks for the detailed question. Yes, we are aware this is a point of confusion. The expected output can be used purely for evaluation. Let me respond to you with some code samples and exact procedure to follow in a bit.

@github-actions github-actions bot added the stale label Jan 23, 2025
Copy link

This issue has been automatically marked as stale due to inactivity. It will be closed in 3 days if no further activity occurs.

Copy link

This issue has been automatically marked as stale due to inactivity. It will be closed in 3 days if no further activity occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants