- To read and write dataset in the
jsonl
format (e.g., the Pile), runcargo build --release --bin jsonl
to compile and generate the binary executable file. Alternatively, runcargo run --bin jsonl [args]
to compile and immediately run the executable. - To read and write dataset in the
json.gz
format, use the default format or pass--bin json-gz
instead. - Accelerate the deduplication for multiple files by passing
threads
.
Trouble-shooting:
- if
cargo
isn't installed in your system, run
sudo snap install rustup --classic # install
rustup default stable # setup default configuration
- if
cmake
isn't installed, run
sudo apt update
sudo apt install cmake
Comands for the Pile deduplication:
target/release/jsonl \
--bloom-filter-file filter.bff \
--bloom-filter-size 2147483648 \
--expected-ngram-count 1000000000 \
--output-directory deduped_Github/ \
--threads 32 \
$PATH_TO_PILE/Github/*.jsonl
Note: The suggested bloom filter size for the 1000000000 expected ngram is 2147483648. I am not sure about how to properly set it to achieve a good accuracy-efficiency trade-off.
A Bloom filter is a probabilistic data structure designed to test whether an element is a member of a set. It is very efficient in terms of space and time, but it allows for some possibility of false positives (indicating an element is in the set when it is not).
A bloom filter starts as an array of bits, all set to 0, and several hash functions. When you add an item, it is processed by each hash function, each of which outputs an index to the bit array. The bits at these indices are set to 1. To check if an item is in the set, process it with the same hash functions. If all the bits at the resulting indices are 1, the items is presumed to be in the set. If any bit is 0, the item is definitely not in the set.
The key is to balance the trade-off between the space requirement and the false positive rate. There are two main factors:
n
: The number of ngrams expected to be stored.m
: The size of the Bloom filter in bits.P
: The false positive probability, usually 0.01-0.02 is acceptable.
The number of hash functions k
and the false possitive probability can be calculated as follows:
The optimal filter size m
is
Here is a quick lookup table used for experiments in the Scaling experiments:
Domain | Num. Ngrams | False Possibility Rate | Bloom Filter Size in Bits | Bloom Filter Size in GB |
---|---|---|---|---|
Full | 1400000000000 | 0.02 | 11399309000000 | 1424 |
PubMed | 6500000000 | 0.02 | 52925361687 | 6.6 |
Example script:
SECONDS=0
mkdir deduped_pubmed
target/release/jsonl \
--bloom-filter-file filter.bff \
--bloom-filter-size 52925361687 \
--expected-ngram-count 6500000000 \
--output-directory deduped_pubmed/ \
--threads 64 \
/mnt/md-256k/massive_ds_data/full/pubmed/*.jsonl
echo "The script took $SECONDS seconds to execute."
The big friendly filter 😁
- Install Rust on your machine.
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
- Add
~/.cargo/bin
to yourPATH
environment variable.
- Run
cargo build --release
. It places the binary attarget/release/bff
. - Run
./target/release/bff --help
to see the available options.
This is how you deduplicate a file against itself:
target/release/bff \
--bloom-filter-file filter.bff \
--bloom-filter-size 274877906944 \
--expected-ngram-count 1000000000 \
--output-directory deduped/ \
input.json.gz
This creates the filter at filter.bff
, with a size of 256 GB.
This size should be a little smaller than the amount of main memory you have.
It calculates the optimal setup for the filter based on the expected number of ngrams.
Getting that number right is very important.
If in doubt, guess high.
It's safer to guess a higher number than a lower number.
The filter will be created in memory, and only written to disk at the end of the job.
To get a lot of speed out of bff
, you have to process multiple files at once:
target/release/bff \
--bloom-filter-file filter.bff \
--bloom-filter-size 274877906944 \
--expected-ngram-count 1000000000 \
--output-directory deduped/ \
*.json.gz
Each input file will run in its own thread, and the filter will be shared between them. In the end, as before the filter will be written to disk.
You can stick ngrams into the filter ahead of time, for example if you want to decontaminate your dataset:
target/release/bff \
--bloom-filter-file decontaminating_filter.bff \
--bloom-filter-size 274877906944 \
--expected-ngram-count 1000000000 \
--output-directory deduped/ \
--filtering-threshold 1.0 \
my_test_set.json.gz
This will copy the output unchanged to the deduped/
directory, but it will also produce a filter that you can use afterwards.
It is important that you still take a good guess at the ngram count you expect to see when you do the actual
deduplication.
The parameters of the bloom filter are baked in when you first create the file, so you have to guess right the
first time.
If you only want to decontaminate, but not deduplicate against itself, you can do that by using the filter you just created in the previous step:
target/release/bff \
--bloom-filter-file decontaminating_filter.bff \
--bloom-filter-size 274877906944 \
--expected-ngram-count 1000000000 \
--output-directory deduped/ \
--update-bloom-filter false \
*.json.gz
If you are using the filter this way, you can use the number of ngrams in the decontamination set for the
--expected-ngram-count
parameter.
Since this is usually much smaller, it might make the filter run faster.