Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset #1

Closed
WH0226 opened this issue Jun 3, 2024 · 4 comments
Closed

Dataset #1

WH0226 opened this issue Jun 3, 2024 · 4 comments

Comments

@WH0226
Copy link

WH0226 commented Jun 3, 2024

Hello, are the dataset links provided all tokenized data? Is there any original data?

@mcleish7
Copy link
Owner

mcleish7 commented Jun 3, 2024

There should be a file named something like: +_n_20_m_20_examples_20000000.txt inside the data folder at the same level as the automated plots and the hf_tokenized_dataset folder, each line of this .txt is a sample of the dataset.

@mcleish7 mcleish7 closed this as completed Jun 3, 2024
@WH0226
Copy link
Author

WH0226 commented Jun 3, 2024

I find it, thanks a lot!

@WH0226
Copy link
Author

WH0226 commented Jun 4, 2024

I have a small question. I found that the addition in the +_n_20_m_20_examples_20000000.txt file seems to be in reverse order?

For example, the first data:

10063+787995583888172117=887536583888172117

It should be:

36,001+711,271,888,385,599,787=711,271,888,385,635,788

Is there any reason for this design?

@mcleish7
Copy link
Owner

mcleish7 commented Jun 4, 2024

All the data and code is formatted as needed for the experiments shown in the paper. In Section 3 of the paper we detail that we use a least significant digit first (reversed) format for all numbers for addition. Please see generate_and_tokenize_data.sh and create_data_split.py for examples of how you might generate your own datasets if you require more specialised data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants