-
-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HumanEval benchmarks? #43
Comments
Yeah a code model finetuned from existing ones will be great :) |
Who generated the table of benchmarks posted on the README? I'd like to know if there's an easy way to evaluate new benchmarks, or if whoever generated those benchmarks is willing to run one extra. I just want to see performance of the non-fine-tuned RNN. If this is too much to ask, that's fine. |
IIUC the Alpaca repo doesn't contain the model (because LLaMA isn't openly available except for the leak) but contains the code to fine-tune to make output resemble ChatGPT. Eliezer Yudkowsky seems to be enthusiastic about this. Is it planned to use Alpaca data to fine-tune RWKV-LM? However, the GPT-4 paper also revealed that instruction fine-tuning doesn't improve the model's performance on exams:
|
Thanks for that insight. I really just want to have GPT-4 leak so I can run it on my personal computer. Someone reproduced Stanford's methods and open-sourced the weights with Alpaca-LoRA, and people theorized ways to fine-tune LLaMa-30B after 4 bit quantization (I have a 32 GB GPU). The quantized model is posted here. It seems useful as a primary "reasoning model" in a model ensemble, much better than 7B. That could then delegate other processes to specialized models, like an AGI controlling ANI. I'm not completely sure what RWKV's strongsuits are; where it might stand in the ensemble. I'm just researching all the options to make the most informed choice. For reference, I'm trying to make an AI-powered transpiler that converts massive CUDA code bases to Metal.
No idea yet. I need to see RWKV's out-of-the-box performance before knowing whether it's worthwhile. |
Thanks a lot for the info! I didn't know about Alpaca-LoRA. More is going on around LLaMA than I realized! AFAIK, being an RNN, RWKV is less resource intensive than transformers; although llama.cpp and nolano.org make LLaMA run on consumer hardware, 4bit/8bit inference is also now available for RWKV at https://bellard.org/ts_server/, so I guess RWKV keeps this advantage.
I was actually asking the repo owner, since he's been training an instruction finetuned model at https://github.com/BlinkDL/ChatRWKV, so it's natural to wonder if he plan to use this new dataset :) |
Regarding orders of magnitude, here's the resources required to train each of the high-performing codegen models. @BlinkDL can your RNN absorb >100B tokens of knowledge on consumer hardware in a single GPU-day?
|
RWKV has less FLOPS than GPT in both training and inference so it's simply faster and saves VRAM. |
https://twitter.com/piesposi_to/status/1636780485597708290 seems useful to finetune ChatRWKV for Chinese instruction following. |
FYI there's this recent work that achieves 88% at HumanEval@1. |
That’s nice to hear, but Reflexion is only worth the effort if you’re using it to complement GPT-4. It’s not a substitute for the gap between LLaMA and GPT-4, but maybe between GPT-3.5. The core issue, is GPT-4 has advanced reasoning capabilities limited only by its narrow-context, linear-thought architecture. Reflexion sort of compensates for the architectural limitation. Other LLMs don’t have near enough reasoning capabilities to begin with. Meanwhile, I’ve realized I can accomplish my original goal with a different approach. Bing GPT-4 can plan out how to tackle a hard programming task, removing the psychological barrier to me doing the task myself. That’s helped me revive tedious projects like FP64 emulation and could very well let me port medium-sized CUDA code bases (e.g. oxDNA). The AI does not need to scan all the code files, which was my original idea with a locally hosted LLM. We live in a very different world than one month ago… |
Hi, I was wondering whether this model can achieve GPT-4 level performance on the HumanEval benchmark, a proxy for effectiveness at code generation. I'm fine if I have to train or transfer-learn, but I only have a single GPU. Massive transformers require way too much compute. I'd also like your opinion on how well the model might work in tandem with InCoder and LLaMa/Alpaca in a model ensemble. Sort of like Stable Diffusion, which has multiple different models that each specialize in a different task. Thanks!
The text was updated successfully, but these errors were encountered: