Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama2-7B-chat-4k测试出来结果不一样 #66

Closed
PengWenChen opened this issue Jun 6, 2024 · 2 comments
Closed

Llama2-7B-chat-4k测试出来结果不一样 #66

PengWenChen opened this issue Jun 6, 2024 · 2 comments

Comments

@PengWenChen
Copy link

PengWenChen commented Jun 6, 2024

Reopen issue #55
@bys0318 (@slatter666) 您好~
我嘗試跑 Llama2-7B-chat-4k(https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
出來的結果也跟您 leaderboard 上的成績不同,差異蠻大的。
我的執行環境不能連網,因此跟您的 pred.py 唯一的差異是 loading data & model & tokenizer from local 而已。
想請問成績為何有如此大的差異呢? 謝謝
我跑出來的成績 (seed 使用 pred.py 原始的 seed-42):
{
"narrativeqa": 14.57,
"qasper": 6.6,
"multifieldqa_en": 3.65,
"multifieldqa_zh": 4.29,
"hotpotqa": 4.27,
"2wikimqa": 5.67,
"musique": 1.3,
"dureader": 15.71,
"gov_report": 24.53,
"qmsum": 16.13,
"multi_news": 2.41,
"vcsum": 0.03,
"trec": 68.0,
"triviaqa": 88.59,
"samsum": 41.38,
"lsht": 19.75,
"passage_count": 0.5,
"passage_retrieval_en": 3.0,
"passage_retrieval_zh": 0.0,
"lcc": 66.64,
"repobench-p": 60.06
}

Copy results from github leaderboard:
{
"narrativeqa": 18.7,
"qasper": 19.2,
"multifieldqa_en": 36.8,
"multifieldqa_zh": 11.9,
"hotpotqa": 25.4,
"2wikimqa": 32.8,
"musique": 9.4,
"dureader": 5.2,
"gov_report": 27.3,
"qmsum": 20.8,
"multi_news": 25.8,
"vcsum": 0.2,
"trec": 61.5,
"triviaqa": 77.8,
"samsum": 40.7,
"lsht": 19.8,
"passage_count": 2.1,
"passage_retrieval_en": 9.8,
"passage_retrieval_zh": 0.5,
"lcc": 52.4,
"repobench-p": 43.8
}

我的資料是下載您README中提供的link:https://huggingface.co/datasets/THUDM/LongBench/resolve/main/data.zip
load data by
data = [json.loads(line for line in open(path, "r", encoding="utf-8")]
load tokenizer model by
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtyp=torch.bfloat16).to(device)

@BlackieMia
Copy link

我也复现了类似的结果I got the similar result of yours.

@PengWenChen
Copy link
Author

Sorry I just found out I accidentally use llama2-7B model instead of llama2-7B-chat model.
The chat version model's scores I run is:
{
"narrativeqa": 18.82,
"qasper": 23.65,
"multifieldqa_en": 36.52,
"multifieldqa_zh": 10.59,
"hotpotqa": 26.4,
"2wikimqa": 31.85,
"musique": 7.76,
"dureader": 5.2,
"gov_report": 26.56,
"qmsum": 21.28,
"multi_news": 26.3,
"vcsum": 0.18,
"trec": 65.0,
"triviaqa": 83.17,
"samsum": 41.0,
"lsht": 18.75,
"passage_count": 1.57,
"passage_retrieval_en": 7.5,
"passage_retrieval_zh": 9.5,
"lcc": 59.04,
"repobench-p": 52.91
}
I think it's pretty closed to the numbers of those on the leaderboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants