Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplication via content addressable object store #15086

Open
hexylena opened this issue Nov 30, 2022 · 21 comments
Open

Deduplication via content addressable object store #15086

hexylena opened this issue Nov 30, 2022 · 21 comments

Comments

@hexylena
Copy link
Member

hexylena commented Nov 30, 2022

We've already got the infrastructure for storing datasets by UUID (and making the associated subdirectories). If, instead of uuid, we stored by sha256sum, we could have instant space de-duplication.

It'd be neat to add this as a backend option for the Object Store, that one could choose to use for storage. Before datasets went to this backend they'd get hashed (sha256, something else, multiple hashes combined?), and stored at a path based on that hash, just like how the uuid backend works. It could even leverage the dataset_hash table (though I'm not sure how this gets populated? Is there a flag I can enable somewhere?)

It wouldn't need to be particularly smart for a first pass (e.g. rejecting a file before transfer, if the hashes matched something inside), it could accept the file and internally decide it was already stored and just update the reference to it.

(I'd suggest ipfs but the performance numbers i've seen are staggeringly abysmal, and all I really want is the CAS portion.)

This issue arises because I have a user that's re-run Trim Galore! across the same dataset multiple times (it's part of the workflow!) which generates giant files that eat through our storage, and I end up with this situation where all of these outputs are bit for bit identical

bfc4cab41e2341844d73b247473e3a61  /data/galaxy/5/c/3/dataset_5c3cae0b-b6a0-4fb5-b2fc-b7f30baec184.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/9/0/f/dataset_90f24ac2-6fca-4af0-b36c-0deeffb7a351.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/7/3/c/dataset_73c4d8d5-db25-46e8-a1bb-6bbcbbe2e322.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/f/6/b/dataset_f6b552d0-1221-464e-a9df-c11445ed23fb.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/4/f/5/dataset_4f5b0ac7-7f10-4642-b509-7f2bc71b7624.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/0/6/a/dataset_06a56572-2d5b-4551-9cc2-e7a55ce37a25.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/1/b/b/dataset_1bbe30e8-ddb4-45d2-b8fe-f44d79b99622.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/a/c/3/dataset_ac3b0d84-4f9e-4c73-bf5a-8dbd149af17d.dat

If they'd been stored by the hash, we still would have wasted the compute time, but we won't waste the storage space (more precious in this scenario.) In order to not waste the compute time as well, we'd need improvements from #6887

@natefoo
Copy link
Member

natefoo commented Dec 13, 2022

Related but maybe not the right issue, we could get a big win from using the precomputed hash of input sources that have them like S3. e.g. if importing a file from a bucket using the S3 file source plugin, four hashes are precomputed including CRC32 and SHA-256, and so we could skip that entire download if the data are already imported.

@hexylena
Copy link
Member Author

ah that's fantastic. another great win. (same for TUS, we can get hashes from that too, right?)

@nsoranzo
Copy link
Member

I think when we discussed this a few weeks ago at a Backend WG meeting we didn't reach a conclusion, but there were concerns about having to calculate hashes for big files as part of a dataset upload/creation. Also it was suggested that this may be better left to the file system layer (if using one that supports it).
For dataset that are tool output, I think the job cache is the way to go, but for uploads where the the hash is calculated any way (like the S3 file source plugin and/or TUS mentioned above), that would indeed be a nice feature.

@hexylena
Copy link
Member Author

Not sure what all was discussed in that WG, but, as an admin I think I'm fine with my users waiting an extra even 30 minutes to have a checksum on their large file they waited 5 hours to upload, for the tradeoffs it would give me without having to rebuild my infra with a checksumming/deduplicating filesystem.

I think the job cache is the way to go,

yeah, agreed, but I guess that will maybe never support cross-user account "deduplication", right? Even in a perfect world of identical hashes, we'd probably still limit it to per user just to prevent unintentional data leaks right?

@natefoo
Copy link
Member

natefoo commented Sep 20, 2023

We don't even have to wait - unchecksummed data can go into a temporary "unhashed" object store and then be moved once the checksum/hash is complete and nothing is using it as an input.

I still think this would be an enormous win for the big servers.

@hexylena
Copy link
Member Author

Agreed. And yeah, that would make migrating to a CAS easier, if things could be renamed once hashed

@natefoo
Copy link
Member

natefoo commented Oct 10, 2023

Just to put some numbers on this, I wrote a script to find all duplicate data in a directory so we could get some kind of idea as to how much space we would save with it. Unfortunately, the script got interrupted at around 19% complete (and it's just a big find so there's no way to resume) walking the ~1.75 PB corral4 object store backend on .org. At the point it died, it was estimating we'd save about 133 TB from deduplication. That number was going up every time I recalculated.

cascalc2.sh

@hexylena
Copy link
Member Author

Just to check my understanding, lower bound is 133 (saved)/1750 (assuming the full OS processed), so already ~10% saved by moving to a CAS? Knowing it'd be higher since we didn't scan the full 1750 TB?

@natefoo
Copy link
Member

natefoo commented Oct 11, 2023

That's correct, yes.

@hexylena
Copy link
Member Author

Crickey, that's a massive number.

@mvdbeek
Copy link
Member

mvdbeek commented Oct 13, 2023

Unpopular opinion, if it's not at least a 50% saving it's not worth the headache ? We wouldn't do something like this for speedups if it made the architecture harder.

@mvdbeek
Copy link
Member

mvdbeek commented Oct 13, 2023

(but i am all for calculating hashes with new data / if the object store supports it)

@hexylena
Copy link
Member Author

That might indeed be an unpopular opinion, especially among folks paying for storage 😆

Your concern is that it makes the architecture harder? Could you elaborate on that, it would be really interesting (as an admin) to hear what you think it would take / how else it would impact the system!

@mvdbeek
Copy link
Member

mvdbeek commented Oct 13, 2023

Could you elaborate on that,

We create the datasets up front, so we don't have the hash. We can't rely on the object store hash if we don't have it, so it's effectively something that needs to be coordinated out of band ... which is maybe what you'd want to do ? Use nate's script and create e.g. hardlinks ?

@natefoo
Copy link
Member

natefoo commented Oct 13, 2023

Per our offline conversations about this, the main hurdle is the timing of moving that data to its CAS path in a manner that doesn't break anything. You can mostly safely do it if it's not a job input, although we don't have any locking to avoid race conditions. It would also be an issue with downloading, where we really don't know.

The issues aren't insurmountable but might be easier to solve in my head than in reality.

EDIT: That said, if we want to calculate a hash for everything anyway, then there is even less reason not to do this.

@mvdbeek
Copy link
Member

mvdbeek commented Oct 13, 2023

You can mostly safely do it if it's not a job input, although we don't have any locking to avoid race conditions

It's a similar problem to cleaning up the object store cache. IIRC someone had a WIP PR that was at least excluding active jobs.

@hexylena
Copy link
Member Author

thanks for the elaborations! appreciate it

@natefoo
Copy link
Member

natefoo commented Oct 13, 2023

It could even be as simple as "hardlink the hash path, which will be used for all subsequent inputs/downloads, and then remove the uuid link after a month out of band." How often is that going to cause problems?

@natefoo
Copy link
Member

natefoo commented Oct 13, 2023

Actually... in theory we don't even need to remove the UUID path if it's a hard link.

@hexylena
Copy link
Member Author

Actually... in theory we don't even need to remove the UUID path if it's a hard link.

I mean yeah that's really what we want, the backing store is CAS, with a front-end that looks like an OS and makes the usual UUID based named files, with hardlinks to the CAS proper. just a process of cp and swapping that (atomically) with a hard link

@hexylena
Copy link
Member Author

I was told of some issues the IRIDA folks faced with "how to manage large datasets being repeatedly copied into Galaxy by different users from an external system, to analyse in Galaxy", and just want to note that use case here—it would be completely solved by a CAS, they could repeatedly upload (or maybe not even upload if hashing is implemented in a nice way / they could provide dataset hashes), and not worry about the storage usage on the Galaxy side.

But they're moving to nextflow for the next version, so maybe it isn't relevant anymore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants