Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chatglm3-6b-32k的中文测试结果远远低于README里的benchmark #59

Closed
Strivin0311 opened this issue Mar 7, 2024 · 5 comments
Closed

Comments

@Strivin0311
Copy link

Strivin0311 commented Mar 7, 2024

我个人在longbench的5个中文任务上测试了一下chatglm3-6b-32k的分数,用的默认的load方式和默认的generation_config参数,也用了greedy search的参数,但是结果远远低于README里记录的benchmark(分数如下所示),想请问一下你们测试的时候,是用的什么generation_config呀?

task my score (default sampling) my score (greed search) benchmark score
vcsum 0.165 0.167 0.178
multifieldqa_zh 0.537 0.545 0.623
dureader 0.388 0.415 0.448
lsht 0.181 0.281 0.420
passage_retrieval_zh 0.400 0.345 0.940
@Strivin0311
Copy link
Author

用了新版的代码,分数已经和官方的一致了,问题应该出在chatglm3的build_chat部分~

@bys0318
Copy link
Member

bys0318 commented Mar 7, 2024

嗯对,是这样的

@BeautyCJ
Copy link

BeautyCJ commented Jun 18, 2024

请问官方发布的benchmark中各模型是如何解码的?greedy search(top_p=0, temperature=1)吗?@bys0318

@BeautyCJ
Copy link

用了新版的代码,分数已经和官方的一致了,问题应该出在chatglm3的build_chat部分~

请问这里用的是 greedy search解码吗?如果用generation_config里的跑出来差别大吗?

@bys0318
Copy link
Member

bys0318 commented Jun 18, 2024

请问官方发布的benchmark中各模型是如何解码的?greedy search(top_p=1, temperature=1)吗?@bys0318

是的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants