chatglm3-6b-32k的中文测试结果远远低于README里的benchmark #59

Strivin0311 · 2024-03-07T11:28:21Z

我个人在longbench的5个中文任务上测试了一下chatglm3-6b-32k的分数，用的默认的load方式和默认的generation_config参数，也用了greedy search的参数，但是结果远远低于README里记录的benchmark（分数如下所示），想请问一下你们测试的时候，是用的什么generation_config呀？

task	my score (default sampling)	my score (greed search)	benchmark score
vcsum	0.165	0.167	0.178
multifieldqa_zh	0.537	0.545	0.623
dureader	0.388	0.415	0.448
lsht	0.181	0.281	0.420
passage_retrieval_zh	0.400	0.345	0.940

Strivin0311 · 2024-03-07T13:38:20Z

用了新版的代码，分数已经和官方的一致了，问题应该出在chatglm3的build_chat部分~

bys0318 · 2024-03-07T15:39:00Z

嗯对，是这样的

BeautyCJ · 2024-06-18T02:01:20Z

请问官方发布的benchmark中各模型是如何解码的？greedy search（top_p=0, temperature=1）吗？@bys0318

BeautyCJ · 2024-06-18T02:04:46Z

用了新版的代码，分数已经和官方的一致了，问题应该出在chatglm3的build_chat部分~

请问这里用的是 greedy search解码吗？如果用generation_config里的跑出来差别大吗？

bys0318 · 2024-06-18T03:38:31Z

请问官方发布的benchmark中各模型是如何解码的？greedy search（top_p=1, temperature=1）吗？@bys0318

是的

Strivin0311 closed this as completed Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chatglm3-6b-32k的中文测试结果远远低于README里的benchmark #59

chatglm3-6b-32k的中文测试结果远远低于README里的benchmark #59

Strivin0311 commented Mar 7, 2024 •

edited

Loading

Strivin0311 commented Mar 7, 2024

bys0318 commented Mar 7, 2024

BeautyCJ commented Jun 18, 2024 •

edited

Loading

BeautyCJ commented Jun 18, 2024

bys0318 commented Jun 18, 2024

chatglm3-6b-32k的中文测试结果远远低于README里的benchmark #59

chatglm3-6b-32k的中文测试结果远远低于README里的benchmark #59

Comments

Strivin0311 commented Mar 7, 2024 • edited Loading

Strivin0311 commented Mar 7, 2024

bys0318 commented Mar 7, 2024

BeautyCJ commented Jun 18, 2024 • edited Loading

BeautyCJ commented Jun 18, 2024

bys0318 commented Jun 18, 2024

Strivin0311 commented Mar 7, 2024 •

edited

Loading

BeautyCJ commented Jun 18, 2024 •

edited

Loading