You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Firstly, thanks for the insightful paper and code! 💯
I am carefully going through the code as it's closely related to one of my current use cases. I have a few questions about some details inside is_better_than_prob, and I hope we can have a discussion. 😸
I'm a bit confused about prob_C. The prompt is simply asking the LLM to output A or B, so why would we expect it to give small probabilities to C? Is it just because C is the next byte after B?
What exactly does prob_C mean? Does it act as a "tie" between A and B?
Regarding the calibration part, as I understand: compare_result.prob_A is the probability of selecting id1 when it's placed first, while compare_result_reverse.prob_B is the probability of selecting id1 when it's placed second. Shouldn't we calculate raw_prob_A as compare_result.prob_A + compare_result_reverse.prob_B? If id1 is truly better, we should expect both compare_result.prob_A and compare_result_reverse.prob_B to be high. If compare_result.prob_A is high but compare_result_reverse.prob_B becomes unexpectedly low, then we should assume that id1 is not really better, right? But why we use 1-compare_result_reverse.prob_B in the code?
Hi @SauceCat , thanks for your interest in this project.
prob_C is mostly deprecated in the current repo. It was designed to handle two cases: 1) when LLM produce A=B (tie), and 2) when the output cannot be parsed into A or B. Now both cases are handled by forcing prob_A=prob_B=0.5. If you would like to customize this behavior, it also makes sense.
Thank you for spotting this. The calibrated probability should be compare_result.prob_A + compare_result_reverse.prob_B. The calibration is basically averaging the probabilities with switching orders. I guess we made this bug when we reorganize the code base. I will fix this ASAP.
I'm also developing an improved version of PairS-beam algorithm, which will be released within a month or two.
If you would like to discuss anything related to this project, you are welcomed! Just drop me an email and we can start from there.
It makes sense to me. I think it might be worth trying out a more intuitive setting: prompting the model to output A, B, or Tie explicitly. Using log_prod makes it almost impossible to produce A = B, I guess.
LGTM!
That would be great! In fact, the existing PairS-beam algorithm looks a bit complicated to me. Looking forward to the improved version! 😸
I also read the ZEPO paper as well as the Batch Calibration paper and briefly went through the code. I indeed have some thoughts derived from real-world use cases to share and discuss. Let me draft the email.
Firstly, thanks for the insightful paper and code! 💯
I am carefully going through the code as it's closely related to one of my current use cases. I have a few questions about some details inside
is_better_than_prob
, and I hope we can have a discussion. 😸prob_C
. The prompt is simply asking the LLM to output A or B, so why would we expect it to give small probabilities to C? Is it just because C is the next byte after B?prob_C
mean? Does it act as a "tie" between A and B?compare_result.prob_A
is the probability of selectingid1
when it's placed first, whilecompare_result_reverse.prob_B
is the probability of selectingid1
when it's placed second. Shouldn't we calculateraw_prob_A
ascompare_result.prob_A + compare_result_reverse.prob_B
? Ifid1
is truly better, we should expect bothcompare_result.prob_A
andcompare_result_reverse.prob_B
to be high. Ifcompare_result.prob_A
is high butcompare_result_reverse.prob_B
becomes unexpectedly low, then we should assume thatid1
is not really better, right? But why we use1-compare_result_reverse.prob_B
in the code?The text was updated successfully, but these errors were encountered: