Command line tools to manipulate the data from those multi-billion passwords collections.
The full processing will take a couple of days and will generate a file structure that can be queried almost in o(1).
$ <query> [email protected]
[email protected]:toto123
The total number of unique records in the final dataset (Collection 1 to 5 + AntiPublic + Breach Compilation) is around 3.72 billions (3,372,591,561 to be precise).
Create a virtual environment and install the package.
virtualenv -p python3 venv
source venv/bin/activate
make install
find "$1" -name '*.tar.gz' -execdir tar -xzvf '{}' -C extracted \;
find . -name "*.rar" -exec unrar x -o+ {} \;
Processing the Collection 1 is much faster than the Collections 2-5. The estimates for Collections 2-5 are reported below.
The parsing took around 20 hours on my server (CPU i7-8700K, 32GB of memory). I didn't have a large enough SSD to store all the temporary computations so everything was done on a standard HDD. A faster disk will surely make the processing faster.
The sorting/removing duplicates step took 15 hours in total.
The splitting into the smaller files (this file struct makes every query almost instantaneous) took a couple of hours at most.
In total, expect around 2 days to process the Collections 2-5.
breach parse --path /path/to/extracted --success_file success.txt --failure_file failure.txt --cython_acceleration
rm -rf tmp && mkdir tmp # you need like 750GB in tmp/. By default /tmp/ is not enough for this!
cat success.txt | pv -cN cut | sort -T tmp -u -S 90% --parallel=12 | pv -cN cut > success_sorted.txt
breach split --file success_sorted.txt --out data
The dataset is available here: https://github.com/philipperemy/tensorflow-1.4-billion-password-analysis
It's easy to convert the large BreachCompilation
dataset to this format by running those commands.
Expect those commands to take some time (less than a day).
find /path/to/BreachCompilation/ -type f -exec cat {} + > breach_compilation.txt
rm -rf tmp && mkdir tmp # By default /tmp/ is not enough for this!
cat breach_compilation.txt | pv -cN cut | sort -T tmp -u -S 90% --parallel=12 | pv -cN cut > breach_compilation_sorted.txt
breach split --file breach_compilation_sorted.txt --out data_breach_compilation_sorted
From there a simple breach merge
will be enough to merge it to the Collections 1 & 2 to 5.
Run the Collection 1 and Collections 2-5 through the processing step described above.
You will have two directories: /path/to/collections1_data
and /path/to/collections2_5_data
.
Additionally, if you have the other dataset BreachCompilation, you will have another directory /path/to/data_breach_compilation_sorted
, generated by the step above.
The merge is destructive so it's better to create a copy of the output first and then merge each one into the output.
cp -rf /path/to/collections1_data /path/to/big_dataset
breach merge --src /path/to/collections2_5_data --dest /path/to/big_dataset
breach merge --src /path/to/data_breach_compilation_sorted --dest /path/to/big_dataset
The manual of the command line tool can be fetched by running breach dumphelp
.
Usage: cli [OPTIONS] COMMAND [ARGS]...
Options:
--debug / --no-debug
--help Show this message and exit.
Commands:
chunk chunk large TXT files into smaller files.
clean Cleans a query friendly folder PATH. Move incorrect records and
sort the files.
dumphelp
evaluate Evaluates some metrics such as precision/recall (e.g. is OLD into
NEW).
merge Merges dataset SRC into dataset DEST.
parse Parses an unstructured folder PATH of many files and generates two
files: SUCCESS_FILE and FAILURE_FILE. All valid email:password
will go to SUCCESS_FILE.
sort Sorts a query friendly folder PATH. Target is itself.
split Converts a large FILE to a query friendly folder OUT (e.g. a/b/c).
Use RESTART_FROM to resume from the i-th line.
test Infers passwords of a list of emails defined in FILE with a query
friendly folder DATASET.
Usage: cli dumphelp [OPTIONS]
Options:
--help Show this message and exit.
Usage: cli split [OPTIONS]
Options:
--file FILE [required]
--out DIRECTORY [required]
--restart_from INTEGER [default: 0]
--help Show this message and exit.
Usage: cli chunk [OPTIONS]
Options:
--path DIRECTORY [required]
--size INTEGER [default: 50]
--help Show this message and exit.
Usage: cli sort [OPTIONS]
Options:
--path DIRECTORY [required]
--help Show this message and exit.
Usage: cli clean [OPTIONS]
Options:
--path DIRECTORY [required]
--help Show this message and exit.
Usage: cli test [OPTIONS]
Options:
--file FILE [required]
--dataset [breach_compilation|collections_1|collections_2_5|all]
[required]
--help Show this message and exit.
Usage: cli parse [OPTIONS]
Options:
--path DIRECTORY [required]
--success_file FILE [required]
--failure_file FILE [required]
--cython_acceleration / --no-cython_acceleration
--help Show this message and exit.
Usage: cli merge [OPTIONS]
Options:
--src DIRECTORY [required]
--dest DIRECTORY [required]
--help Show this message and exit.
Usage: cli evaluate [OPTIONS]
Options:
--old DIRECTORY [required]
--new DIRECTORY [required]
--help Show this message and exit.