Small question about eval dataset #2

KevinZhoutianyi · 2024-08-30T00:19:34Z

Great work!
A small question here. I was wondering if your eval script test on train data as well?
For pretrain, you split the dataset to test and train. You use the test data to get validation loss.

Then you generate another dataset with different numebr length and use arithmetic_eval_quicker.py to evaluate the accuracy.
Isn't there an overlap between your test and train for arithmetic_eval_quicker.py?

mcleish7 · 2024-08-30T15:59:47Z

Thank you for your interest and reading our code so diligently!

I was wondering if your eval script test on train data as well?

No, we didn't do this unfortunately, we only ran evaluation on the test sets.

Isn't there an overlap between your test and train for arithmetic_eval_quicker.py?

Yes, there a tiny overlap when the sample space is not large enough to sample both the training and test data disjointly, but this is only true for very small numbers which make up a tiny amount of our evaluation. To be precise, less than 0.078% of the test data is in the 20x20 training data and this overlap occurs only for small digit additions.

The high level reasons:

The sample space here is huge, for an addition of lengths n and m, we have $10^{n+m}$ possible samples which becomes very big very fast.
The evaluation and training data are generated in slightly different ways. For the training data, we generate a single sample of length (n,m), then a single sample of length (n,m+1) and so on. For evaluation data, we generate 100 samples of length (n,m) one after another meaning our randomness is being sampled differently; so we (more than often) end up in a different place in the sample space than where we sampled the training data from.

We checked this to be sure. As you can see in the areas where the sample space is too small we do have overlap (this is often because the training data becomes close to exhaustive in these areas). I think we can safely say the overlap is tiny when combined with the extensive evaluation completed outside of the training data and the strong performances shown by the models on extreme out of distribution data.

Of course, if you add a feature to remove this overlap (I would be interested to see how the models learn without any of these small digit additions included in training data) please open a pull request. Please feel free to reopen the issue if you have further questions or drop me an email.

Thanks,
Sean

KevinZhoutianyi · 2024-08-30T18:12:24Z

Thanks for your detailed reply!

MartialTerran · 2024-12-17T06:58:25Z

Hi. I read your paper with interest. Because of the notorious failure of LLMs to accurately compute, I have been designing and training (on gaming PC with one NVIDIA GPU) very small GPT-2 models from scratch in pytorch to do arithmetic on fixed operand sizes. (and not using any special positional embedding) including included_operations=["+", "-", "*", "∧", "~", "∨", "¬∨", "⊕", "≡", "="]. And then I came upon your paper https://huggingface.co/papers/2405.17399 serendiptously via https://huggingface.co/collections/The-Great-Genius/skynet-66366061cc7af105efb7e0ca (I was looking for terminators) Speaking for myself, I am dissatisfied if my model makes any errors in the domain of the training set.
My focus has been on achieving perfect accuracy within the training domain. I train my models to a very low loss (epoch-average below 0.04, with per-sample loss often below 0.0001 for some operations), and I'm curious about the level of accuracy your models achieve within the 20x20 digit training domain.
My models do not seem to know what addition or subtraction means, and they seem to just memorize all the training examples without any generalizing. This is typically accomplished using a 1-layer Transformer, or a 2-layer Transformer (so I do not have to wait long for one GPU to drive down the acceptable loss). I probably do not train my models the same way that you trained your models. I generally train my models down to an epoch-average loss of below 0.04. At that epoch-average, generally every species of arithmetic computation will have a per sample loss below 0.05. meanwhile the typical lowest loss for a species of arithmetic will be less than 0.0001. I have noticed that the first batch in every epoch has the lowest loss, and the epoch-average loss increases monotonically (e.g., almost linearly) as the batch number increases within the epoch. I suspect that this phenomena is independent of the order of presentation of operations, operands. I have not done any shuffling/randomization within my synthetic dataset. Every epoch is the same sequence of operations, operands. But, I am pleased to see in my small models that there are ways to obtain some reliably numerically correct computations out of Transformer architecture. I would not be satisfied with 99.999 accuracy on the training set. So, I guess that my interest is complementary to that of Kevin. I want to see validation/testing performed with 100% overlap of the training domain (20x20) What I do not clearly see in your paper is a graph or a definite statement indicating that within the 20x20 (operand size) training set, that your models achieved 100% accuracy every time for all trainings and computations given operands of up to 20 digits. It would be illuminating if you drew a boundary line in your graphs and stated ("100% Accuracy in all computations and at all the times within this zone"). 20-digit computation is well larger than the common math burden of most science and engineering, and so if you could certify that your models are able to produce reliable computations (add, subtract, multiply ....) of up to 20x20 digits, that would itself be a valuable practical feature in LLMs that should be replicated in the commercial API models and in any open-source models. Are you pursuing further research into extending your Abacus embedding method into the domain of large textual LLMs? Are you pursuing patents? I have been a Patent Agent in the field of BSEE. Given that I understand your method, I would be very able to help you write a patent application. [email protected]
P.S. I do not have the compute resources to train/reproduce your trained models, but I would like to examine them further. Can you publish to Huggingface:
the checkpoint (parameters) of your best models, vocab.json config.json, and a standalone model.py (model.eval) written in pytorch, with a checkpoint loading feature so that the pretrained models can be quantified to measure how accurate they calculate within the 20x20 digits training domain?

mcleish7 closed this as completed Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small question about eval dataset #2

Small question about eval dataset #2

KevinZhoutianyi commented Aug 30, 2024

mcleish7 commented Aug 30, 2024

KevinZhoutianyi commented Aug 30, 2024

MartialTerran commented Dec 17, 2024

Small question about eval dataset #2

Small question about eval dataset #2

Comments

KevinZhoutianyi commented Aug 30, 2024

mcleish7 commented Aug 30, 2024

KevinZhoutianyi commented Aug 30, 2024

MartialTerran commented Dec 17, 2024