Dataset #1

WH0226 · 2024-06-03T13:02:18Z

Hello, are the dataset links provided all tokenized data? Is there any original data?

mcleish7 · 2024-06-03T13:16:27Z

There should be a file named something like: +_n_20_m_20_examples_20000000.txt inside the data folder at the same level as the automated plots and the hf_tokenized_dataset folder, each line of this .txt is a sample of the dataset.

WH0226 · 2024-06-03T14:41:57Z

I find it, thanks a lot!

WH0226 · 2024-06-04T13:07:27Z

I have a small question. I found that the addition in the +_n_20_m_20_examples_20000000.txt file seems to be in reverse order?

For example, the first data:

10063+787995583888172117=887536583888172117

It should be:

36,001+711,271,888,385,599,787=711,271,888,385,635,788

Is there any reason for this design?

mcleish7 · 2024-06-04T13:35:40Z

All the data and code is formatted as needed for the experiments shown in the paper. In Section 3 of the paper we detail that we use a least significant digit first (reversed) format for all numbers for addition. Please see generate_and_tokenize_data.sh and create_data_split.py for examples of how you might generate your own datasets if you require more specialised data.

mcleish7 closed this as completed Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset #1

Dataset #1

WH0226 commented Jun 3, 2024

mcleish7 commented Jun 3, 2024

WH0226 commented Jun 3, 2024

WH0226 commented Jun 4, 2024

mcleish7 commented Jun 4, 2024

Dataset #1

Dataset #1

Comments

WH0226 commented Jun 3, 2024

mcleish7 commented Jun 3, 2024

WH0226 commented Jun 3, 2024

WH0226 commented Jun 4, 2024

mcleish7 commented Jun 4, 2024