Llama2-7B-chat-4k测试出来结果不一样 #66

PengWenChen · 2024-06-06T06:03:39Z

Reopen issue #55
嗨 @bys0318 (@slatter666) 您好~
我嘗試跑 Llama2-7B-chat-4k(https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
出來的結果也跟您 leaderboard 上的成績不同，差異蠻大的。
我的執行環境不能連網，因此跟您的 pred.py 唯一的差異是 loading data & model & tokenizer from local 而已。
想請問成績為何有如此大的差異呢? 謝謝
我跑出來的成績 (seed 使用 pred.py 原始的 seed-42):
{
"narrativeqa": 14.57,
"qasper": 6.6,
"multifieldqa_en": 3.65,
"multifieldqa_zh": 4.29,
"hotpotqa": 4.27,
"2wikimqa": 5.67,
"musique": 1.3,
"dureader": 15.71,
"gov_report": 24.53,
"qmsum": 16.13,
"multi_news": 2.41,
"vcsum": 0.03,
"trec": 68.0,
"triviaqa": 88.59,
"samsum": 41.38,
"lsht": 19.75,
"passage_count": 0.5,
"passage_retrieval_en": 3.0,
"passage_retrieval_zh": 0.0,
"lcc": 66.64,
"repobench-p": 60.06
}

Copy results from github leaderboard:
{
"narrativeqa": 18.7,
"qasper": 19.2,
"multifieldqa_en": 36.8,
"multifieldqa_zh": 11.9,
"hotpotqa": 25.4,
"2wikimqa": 32.8,
"musique": 9.4,
"dureader": 5.2,
"gov_report": 27.3,
"qmsum": 20.8,
"multi_news": 25.8,
"vcsum": 0.2,
"trec": 61.5,
"triviaqa": 77.8,
"samsum": 40.7,
"lsht": 19.8,
"passage_count": 2.1,
"passage_retrieval_en": 9.8,
"passage_retrieval_zh": 0.5,
"lcc": 52.4,
"repobench-p": 43.8
}

我的資料是下載您README中提供的link:https://huggingface.co/datasets/THUDM/LongBench/resolve/main/data.zip
load data by
data = [json.loads(line for line in open(path, "r", encoding="utf-8")]
load tokenizer model by
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtyp=torch.bfloat16).to(device)

The text was updated successfully, but these errors were encountered:

BlackieMia · 2024-06-20T06:18:10Z

我也复现了类似的结果I got the similar result of yours.

PengWenChen · 2024-06-25T08:27:44Z

Sorry I just found out I accidentally use llama2-7B model instead of llama2-7B-chat model.
The chat version model's scores I run is:
{
"narrativeqa": 18.82,
"qasper": 23.65,
"multifieldqa_en": 36.52,
"multifieldqa_zh": 10.59,
"hotpotqa": 26.4,
"2wikimqa": 31.85,
"musique": 7.76,
"dureader": 5.2,
"gov_report": 26.56,
"qmsum": 21.28,
"multi_news": 26.3,
"vcsum": 0.18,
"trec": 65.0,
"triviaqa": 83.17,
"samsum": 41.0,
"lsht": 18.75,
"passage_count": 1.57,
"passage_retrieval_en": 7.5,
"passage_retrieval_zh": 9.5,
"lcc": 59.04,
"repobench-p": 52.91
}
I think it's pretty closed to the numbers of those on the leaderboard.

PengWenChen closed this as completed Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama2-7B-chat-4k测试出来结果不一样 #66

Llama2-7B-chat-4k测试出来结果不一样 #66

PengWenChen commented Jun 6, 2024 •

edited

Loading

BlackieMia commented Jun 20, 2024

PengWenChen commented Jun 25, 2024

Llama2-7B-chat-4k测试出来结果不一样 #66

Llama2-7B-chat-4k测试出来结果不一样 #66

Comments

PengWenChen commented Jun 6, 2024 • edited Loading

BlackieMia commented Jun 20, 2024

PengWenChen commented Jun 25, 2024

PengWenChen commented Jun 6, 2024 •

edited

Loading