Deduplication via content addressable object store #15086

hexylena · 2022-11-30T11:22:16Z

We've already got the infrastructure for storing datasets by UUID (and making the associated subdirectories). If, instead of uuid, we stored by sha256sum, we could have instant space de-duplication.

It'd be neat to add this as a backend option for the Object Store, that one could choose to use for storage. Before datasets went to this backend they'd get hashed (sha256, something else, multiple hashes combined?), and stored at a path based on that hash, just like how the uuid backend works. It could even leverage the dataset_hash table (though I'm not sure how this gets populated? Is there a flag I can enable somewhere?)

It wouldn't need to be particularly smart for a first pass (e.g. rejecting a file before transfer, if the hashes matched something inside), it could accept the file and internally decide it was already stored and just update the reference to it.

(I'd suggest ipfs but the performance numbers i've seen are staggeringly abysmal, and all I really want is the CAS portion.)

This issue arises because I have a user that's re-run Trim Galore! across the same dataset multiple times (it's part of the workflow!) which generates giant files that eat through our storage, and I end up with this situation where all of these outputs are bit for bit identical

bfc4cab41e2341844d73b247473e3a61  /data/galaxy/5/c/3/dataset_5c3cae0b-b6a0-4fb5-b2fc-b7f30baec184.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/9/0/f/dataset_90f24ac2-6fca-4af0-b36c-0deeffb7a351.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/7/3/c/dataset_73c4d8d5-db25-46e8-a1bb-6bbcbbe2e322.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/f/6/b/dataset_f6b552d0-1221-464e-a9df-c11445ed23fb.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/4/f/5/dataset_4f5b0ac7-7f10-4642-b509-7f2bc71b7624.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/0/6/a/dataset_06a56572-2d5b-4551-9cc2-e7a55ce37a25.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/1/b/b/dataset_1bbe30e8-ddb4-45d2-b8fe-f44d79b99622.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/a/c/3/dataset_ac3b0d84-4f9e-4c73-bf5a-8dbd149af17d.dat

If they'd been stored by the hash, we still would have wasted the compute time, but we won't waste the storage space (more precious in this scenario.) In order to not waste the compute time as well, we'd need improvements from #6887

The text was updated successfully, but these errors were encountered:

natefoo · 2022-12-13T15:00:30Z

Related but maybe not the right issue, we could get a big win from using the precomputed hash of input sources that have them like S3. e.g. if importing a file from a bucket using the S3 file source plugin, four hashes are precomputed including CRC32 and SHA-256, and so we could skip that entire download if the data are already imported.

hexylena · 2022-12-13T15:06:58Z

ah that's fantastic. another great win. (same for TUS, we can get hashes from that too, right?)

nsoranzo · 2022-12-13T15:18:26Z

I think when we discussed this a few weeks ago at a Backend WG meeting we didn't reach a conclusion, but there were concerns about having to calculate hashes for big files as part of a dataset upload/creation. Also it was suggested that this may be better left to the file system layer (if using one that supports it).
For dataset that are tool output, I think the job cache is the way to go, but for uploads where the the hash is calculated any way (like the S3 file source plugin and/or TUS mentioned above), that would indeed be a nice feature.

hexylena · 2022-12-13T15:27:22Z

Not sure what all was discussed in that WG, but, as an admin I think I'm fine with my users waiting an extra even 30 minutes to have a checksum on their large file they waited 5 hours to upload, for the tradeoffs it would give me without having to rebuild my infra with a checksumming/deduplicating filesystem.

I think the job cache is the way to go,

yeah, agreed, but I guess that will maybe never support cross-user account "deduplication", right? Even in a perfect world of identical hashes, we'd probably still limit it to per user just to prevent unintentional data leaks right?

natefoo · 2023-09-20T13:58:47Z

We don't even have to wait - unchecksummed data can go into a temporary "unhashed" object store and then be moved once the checksum/hash is complete and nothing is using it as an input.

I still think this would be an enormous win for the big servers.

hexylena · 2023-09-20T14:35:39Z

Agreed. And yeah, that would make migrating to a CAS easier, if things could be renamed once hashed

natefoo · 2023-10-10T22:19:49Z

Just to put some numbers on this, I wrote a script to find all duplicate data in a directory so we could get some kind of idea as to how much space we would save with it. Unfortunately, the script got interrupted at around 19% complete (and it's just a big find so there's no way to resume) walking the ~1.75 PB corral4 object store backend on .org. At the point it died, it was estimating we'd save about 133 TB from deduplication. That number was going up every time I recalculated.

cascalc2.sh

hexylena · 2023-10-11T07:46:25Z

Just to check my understanding, lower bound is 133 (saved)/1750 (assuming the full OS processed), so already ~10% saved by moving to a CAS? Knowing it'd be higher since we didn't scan the full 1750 TB?

natefoo · 2023-10-11T12:31:35Z

That's correct, yes.

hexylena · 2023-10-11T12:33:34Z

Crickey, that's a massive number.

mvdbeek · 2023-10-13T10:32:38Z

Unpopular opinion, if it's not at least a 50% saving it's not worth the headache ? We wouldn't do something like this for speedups if it made the architecture harder.

mvdbeek · 2023-10-13T10:35:22Z

(but i am all for calculating hashes with new data / if the object store supports it)

hexylena · 2023-10-13T11:26:14Z

That might indeed be an unpopular opinion, especially among folks paying for storage 😆

Your concern is that it makes the architecture harder? Could you elaborate on that, it would be really interesting (as an admin) to hear what you think it would take / how else it would impact the system!

mvdbeek · 2023-10-13T13:31:43Z

Could you elaborate on that,

We create the datasets up front, so we don't have the hash. We can't rely on the object store hash if we don't have it, so it's effectively something that needs to be coordinated out of band ... which is maybe what you'd want to do ? Use nate's script and create e.g. hardlinks ?

natefoo · 2023-10-13T13:59:30Z

Per our offline conversations about this, the main hurdle is the timing of moving that data to its CAS path in a manner that doesn't break anything. You can mostly safely do it if it's not a job input, although we don't have any locking to avoid race conditions. It would also be an issue with downloading, where we really don't know.

The issues aren't insurmountable but might be easier to solve in my head than in reality.

EDIT: That said, if we want to calculate a hash for everything anyway, then there is even less reason not to do this.

mvdbeek · 2023-10-13T14:12:18Z

You can mostly safely do it if it's not a job input, although we don't have any locking to avoid race conditions

It's a similar problem to cleaning up the object store cache. IIRC someone had a WIP PR that was at least excluding active jobs.

hexylena · 2023-10-13T14:14:55Z

thanks for the elaborations! appreciate it

natefoo · 2023-10-13T14:21:19Z

It could even be as simple as "hardlink the hash path, which will be used for all subsequent inputs/downloads, and then remove the uuid link after a month out of band." How often is that going to cause problems?

natefoo · 2023-10-13T14:24:46Z

Actually... in theory we don't even need to remove the UUID path if it's a hard link.

hexylena · 2023-10-13T14:32:06Z

Actually... in theory we don't even need to remove the UUID path if it's a hard link.

I mean yeah that's really what we want, the backing store is CAS, with a front-end that looks like an OS and makes the usual UUID based named files, with hardlinks to the CAS proper. just a process of cp and swapping that (atomically) with a hard link

hexylena · 2024-01-17T13:09:05Z

I was told of some issues the IRIDA folks faced with "how to manage large datasets being repeatedly copied into Galaxy by different users from an external system, to analyse in Galaxy", and just want to note that use case here—it would be completely solved by a CAS, they could repeatedly upload (or maybe not even upload if hashing is implemented in a nice way / they could provide dataset hashes), and not worry about the storage usage on the Galaxy side.

But they're moving to nextflow for the next version, so maybe it isn't relevant anymore

hexylena added feature-request friendliness/intermediate labels Nov 30, 2022

mvdbeek added area/backend backend/wishlist Things we want to do, but not now labels Dec 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplication via content addressable object store #15086

Deduplication via content addressable object store #15086

hexylena commented Nov 30, 2022 •

edited

Loading

natefoo commented Dec 13, 2022

hexylena commented Dec 13, 2022

nsoranzo commented Dec 13, 2022

hexylena commented Dec 13, 2022

natefoo commented Sep 20, 2023

hexylena commented Sep 20, 2023

natefoo commented Oct 10, 2023

hexylena commented Oct 11, 2023

natefoo commented Oct 11, 2023

hexylena commented Oct 11, 2023

mvdbeek commented Oct 13, 2023 •

edited

Loading

mvdbeek commented Oct 13, 2023

hexylena commented Oct 13, 2023

mvdbeek commented Oct 13, 2023

natefoo commented Oct 13, 2023 •

edited

Loading

mvdbeek commented Oct 13, 2023

hexylena commented Oct 13, 2023

natefoo commented Oct 13, 2023

natefoo commented Oct 13, 2023

hexylena commented Oct 13, 2023

hexylena commented Jan 17, 2024

Deduplication via content addressable object store #15086

Deduplication via content addressable object store #15086

Comments

hexylena commented Nov 30, 2022 • edited Loading

natefoo commented Dec 13, 2022

hexylena commented Dec 13, 2022

nsoranzo commented Dec 13, 2022

hexylena commented Dec 13, 2022

natefoo commented Sep 20, 2023

hexylena commented Sep 20, 2023

natefoo commented Oct 10, 2023

hexylena commented Oct 11, 2023

natefoo commented Oct 11, 2023

hexylena commented Oct 11, 2023

mvdbeek commented Oct 13, 2023 • edited Loading

mvdbeek commented Oct 13, 2023

hexylena commented Oct 13, 2023

mvdbeek commented Oct 13, 2023

natefoo commented Oct 13, 2023 • edited Loading

mvdbeek commented Oct 13, 2023

hexylena commented Oct 13, 2023

natefoo commented Oct 13, 2023

natefoo commented Oct 13, 2023

hexylena commented Oct 13, 2023

hexylena commented Jan 17, 2024

hexylena commented Nov 30, 2022 •

edited

Loading

mvdbeek commented Oct 13, 2023 •

edited

Loading

natefoo commented Oct 13, 2023 •

edited

Loading