-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
blank nodes reuse memory addresses, causing problems for persistent stores #418
Comments
Yes, indeed, the offending code is here: Line 83 in 2394451
If you're simply serializing with the default IDs, then, yes, that could be a problem.
It's common for parsers to use such a pattern to distinguish the blank nodes read from those created, even though they may use the same identifiers, they become different objects. For example: https://github.com/ruby-rdf/rdf-turtle/blob/develop/lib/rdf/turtle/reader.rb#L229-L233
No, not really, skolemized BNodes are intended to be used in serializations, which are IRIs, that can be turned back into the original BNodes. It makes sense for persistent stores, where there is some row identifier that can be retrieved. They are also not generic, but specific to a given server instance. I don't think they're used too much, and screw up things like lists, which depend on their being BNodes and not URIs. identifiers that are passed along to the persistent store, and translated back on the way back out. ... but yes, sort of, in the way you describe using them. This mapping would be kept around for the duration of the process. As for the guaranteed-unique identifiers, the answer is of course UUIDs (which indeed the current
Maybe it makes sense to just change the default naming scheme to do something like this, and get away from object identifiers altogether. Would that obviate the need for a separate module? Otherwise, then sure, a module with defined behavior that could be included in a Repository/DataStore implementation might be useful. |
I thought about that, but the user can always define the bnode identifier to be whatever they want. If that gets passed through as-is to a persistent store then that can cause problems (e.g. I'm just thinking it's a common pattern just like you highlighted from the Turtle reader; it would merit an |
Please go ahead and do a PR for this, then. @no-reply may want to chime in on it. |
Here's another good one: I created
rdf-lmdb
a while ago but only recently noticed, incidentally when I had a structure containing a lot ofRDF::List
s, that blank node IDs got derived from memory addresses that would reliably repeat across runs on Linux (although only occasionally on macOS which is why it took me so long to notice). The net effect is that I would get bnode structures in persistent storage that were all snarled up together.Now, I should probably go and make some kind of bnode mapping table in
rdf-lmdb
, but I'm also wondering if it would make sense to generalize such a mapping table, since it's a common thing. That, or at least have some way to overrideRDF::Node
's ID generating function in the singleton class (which is what I ended up doing to solve my short-term problem).Anyway, I think what I am proposing is a module that one can
include
into, for example, a persistentRDF::Repository
, that will automatically map whatever bnodes get thrown at it to guaranteed-unique (does that count as skolemized?) identifiers that are passed along to the persistent store, and translated back on the way back out. This mapping would be kept around for the duration of the process. As for the guaranteed-unique identifiers, the answer is of course UUIDs (which indeed the currentRDF::Node
implementation is already capable of), though because they are shorter, I propose the using theuuid-ncname
representation I designed some time ago for just this purpose. See my monkey-patched implementation for what that looks like:The text was updated successfully, but these errors were encountered: