You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(bedrock) gyliu513@Guangyas-MacBook-Pro langtrace % inspect eval example_eval.py --model openai/gpt-3.5-turbo --log-dir langtracefs://cm4lrz7tq00075jmgkdtlq6w4Fetching dataset with id: cm4lrz7tq00075jmgkdtlq6w4 from LangtraceSuccessfully fetched dataset with id: cm4lrz7tq00075jmgkdtlq6w4 from LangtraceSending results to Langtrace for dataset: cm4lrz7tq00075jmgkdtlq6w4Results sent to Langtrace successfully.╭─ example_eval (1 sample): openai/gpt-3.5-turbo ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮│ total time: 0:00:03 dataset: cm4lrz7tq00075jmgkdtlq6w4 ││ openai/gpt-3.5-turbo 553 tokens [I: 438, O: 115] ││ openai/gpt-4o 108 tokens [I: 100, O: 8] ││ ││ accuracy: 0 stderr: 0 ││ ││ Log: langtracefs://cm4lrz7tq00075jmgkdtlq6w4/2025-01-08T13-38-44-05-00_example-eval_kTwLzEvYS3BZjxbyoFpyRs.json │╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
From my understanding: openai/gpt-3.5-turbo will be used to generate the answer for the dataset, and model openai/gpt-4o will be used to do self critique work.
The detail info of evaluation is as below:
Here are the questions:
What is the use of dataset? Seems the evaluation only used the input but it did not use expected output. From the above screenshot, we can see the evaluation plan is generate then self_critique. It does not evaluate based on the dataset expected output, is this right? As far as I can see, the dataset should be treated as golden signal, and the evaluation should be based on dataset. Can you help explain what is the use of expected output in dataset?
From above picture, we can see self_critique actually used more token compared with generate, but why does the evaluation result show we used more tokens with openai/gpt-3.5-turbo which handling the generate?
@gyliu513 - Thanks for the detailed question. Yes, we are aware this is a point of confusion. The expected output can be used purely for evaluation. Let me respond to you with some code samples and exact procedure to follow in a bit.
Doing some test for https://docs.langtrace.ai/features/evaluations and found some problems.
Input
andExpected Output
are required.cm4lrz7tq00075jmgkdtlq6w4
. The answer toDoes RTP NC US has an IBM office
should beYes
. This is a bad dataset.example_eval.py
:And then run the above program as below:
here is the output
From my understanding:
openai/gpt-3.5-turbo
will be used to generate the answer for the dataset, and modelopenai/gpt-4o
will be used to do self critique work.The detail info of evaluation is as below:
Here are the questions:
input
but it did not useexpected output
. From the above screenshot, we can see the evaluation plan isgenerate
thenself_critique
. It does not evaluate based on the datasetexpected output
, is this right? As far as I can see, the dataset should be treated as golden signal, and the evaluation should be based on dataset. Can you help explain what is the use ofexpected output
in dataset?self_critique
actually used more token compared withgenerate
, but why does the evaluation result show we used more tokens withopenai/gpt-3.5-turbo
which handling thegenerate
?model grade fact
? from my test, it isI
, what doesI
means?Thanks!
@karthikscale3 ^^
The text was updated successfully, but these errors were encountered: