Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HumanEval benchmarks? #43

Open
philipturner opened this issue Mar 16, 2023 · 10 comments
Open

HumanEval benchmarks? #43

philipturner opened this issue Mar 16, 2023 · 10 comments

Comments

@philipturner
Copy link

Hi, I was wondering whether this model can achieve GPT-4 level performance on the HumanEval benchmark, a proxy for effectiveness at code generation. I'm fine if I have to train or transfer-learn, but I only have a single GPU. Massive transformers require way too much compute. I'd also like your opinion on how well the model might work in tandem with InCoder and LLaMa/Alpaca in a model ensemble. Sort of like Stable Diffusion, which has multiple different models that each specialize in a different task. Thanks!

@BlinkDL
Copy link
Owner

BlinkDL commented Mar 17, 2023

Yeah a code model finetuned from existing ones will be great :)

@philipturner
Copy link
Author

philipturner commented Mar 17, 2023

Who generated the table of benchmarks posted on the README? I'd like to know if there's an easy way to evaluate new benchmarks, or if whoever generated those benchmarks is willing to run one extra. I just want to see performance of the non-fine-tuned RNN. If this is too much to ask, that's fine.

@alreadydone
Copy link

IIUC the Alpaca repo doesn't contain the model (because LLaMA isn't openly available except for the leak) but contains the code to fine-tune to make output resemble ChatGPT. Eliezer Yudkowsky seems to be enthusiastic about this. Is it planned to use Alpaca data to fine-tune RWKV-LM? However, the GPT-4 paper also revealed that instruction fine-tuning doesn't improve the model's performance on exams:

Note that the model’s capabilities seem to come primarily from the pre-training process—RLHF does not improve exam performance (without active effort, it actually degrades it). But steering of the model comes from the post-training process—the base model requires prompt engineering to even know that it should answer the questions.

@philipturner
Copy link
Author

Thanks for that insight. I really just want to have GPT-4 leak so I can run it on my personal computer. Someone reproduced Stanford's methods and open-sourced the weights with Alpaca-LoRA, and people theorized ways to fine-tune LLaMa-30B after 4 bit quantization (I have a 32 GB GPU). The quantized model is posted here. It seems useful as a primary "reasoning model" in a model ensemble, much better than 7B. That could then delegate other processes to specialized models, like an AGI controlling ANI.

I'm not completely sure what RWKV's strongsuits are; where it might stand in the ensemble. I'm just researching all the options to make the most informed choice. For reference, I'm trying to make an AI-powered transpiler that converts massive CUDA code bases to Metal.

Is it planned to use Alpaca data to fine-tune RWKV-LM?

No idea yet. I need to see RWKV's out-of-the-box performance before knowing whether it's worthwhile.

@alreadydone
Copy link

alreadydone commented Mar 18, 2023

Thanks a lot for the info! I didn't know about Alpaca-LoRA. More is going on around LLaMA than I realized! AFAIK, being an RNN, RWKV is less resource intensive than transformers; although llama.cpp and nolano.org make LLaMA run on consumer hardware, 4bit/8bit inference is also now available for RWKV at https://bellard.org/ts_server/, so I guess RWKV keeps this advantage.

Is it planned to use Alpaca data to fine-tune RWKV-LM?

I was actually asking the repo owner, since he's been training an instruction finetuned model at https://github.com/BlinkDL/ChatRWKV, so it's natural to wonder if he plan to use this new dataset :)

@philipturner
Copy link
Author

philipturner commented Mar 18, 2023

Regarding orders of magnitude, here's the resources required to train each of the high-performing codegen models. @BlinkDL can your RNN absorb >100B tokens of knowledge on consumer hardware in a single GPU-day?

Model Parameters HumanEval@1 HumanEval@100 MBPP@1 Training Tokens
GPT-4 family ~350B-2.5T 67.0% ~98% - ~22-400 trillion
GPT-3.5 175B 48.1% ~92% - ~8-20 trillion
LLaMa-30B 30B 21.7% 70.7% 30.2@1 1.4 trillion
InCoder-6.7B 6.7B 15.2% 47.0% 19.4@1 ~60 billion
RWKV-LM 14B - - - -
Model GPUs GPU-Hours GPU Type per-GPU BW sqrt(arithmetic intensity)
GPT-4 family 10000 5-90 million A100 2039 GB/s 12.37
GPT-3.5 10000 5-90 million A100 2039 GB/s 12.37
LLaMa-30B 2048 530,432 A100 2039 GB/s 12.37
InCoder-6.7B 248 142,848 V100 900 GB/s 11.79
Me 1 24 M1 Max 400 GB/s 4.59

@BlinkDL
Copy link
Owner

BlinkDL commented Mar 18, 2023

RWKV has less FLOPS than GPT in both training and inference so it's simply faster and saves VRAM.

@alreadydone
Copy link

alreadydone commented Mar 18, 2023

https://twitter.com/piesposi_to/status/1636780485597708290 seems useful to finetune ChatRWKV for Chinese instruction following.

@alreadydone
Copy link

FYI there's this recent work that achieves 88% at HumanEval@1.

@philipturner
Copy link
Author

That’s nice to hear, but Reflexion is only worth the effort if you’re using it to complement GPT-4. It’s not a substitute for the gap between LLaMA and GPT-4, but maybe between GPT-3.5. The core issue, is GPT-4 has advanced reasoning capabilities limited only by its narrow-context, linear-thought architecture. Reflexion sort of compensates for the architectural limitation. Other LLMs don’t have near enough reasoning capabilities to begin with.

Meanwhile, I’ve realized I can accomplish my original goal with a different approach. Bing GPT-4 can plan out how to tackle a hard programming task, removing the psychological barrier to me doing the task myself. That’s helped me revive tedious projects like FP64 emulation and could very well let me port medium-sized CUDA code bases (e.g. oxDNA). The AI does not need to scan all the code files, which was my original idea with a locally hosted LLM.

We live in a very different world than one month ago…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants