blank nodes reuse memory addresses, causing problems for persistent stores #418

doriantaylor · 2020-07-11T20:19:22Z

Here's another good one: I created rdf-lmdb a while ago but only recently noticed, incidentally when I had a structure containing a lot of RDF::Lists, that blank node IDs got derived from memory addresses that would reliably repeat across runs on Linux (although only occasionally on macOS which is why it took me so long to notice). The net effect is that I would get bnode structures in persistent storage that were all snarled up together.

Now, I should probably go and make some kind of bnode mapping table in rdf-lmdb, but I'm also wondering if it would make sense to generalize such a mapping table, since it's a common thing. That, or at least have some way to override RDF::Node's ID generating function in the singleton class (which is what I ended up doing to solve my short-term problem).

Anyway, I think what I am proposing is a module that one can include into, for example, a persistent RDF::Repository, that will automatically map whatever bnodes get thrown at it to guaranteed-unique (does that count as skolemized?) identifiers that are passed along to the persistent store, and translated back on the way back out. This mapping would be kept around for the duration of the process. As for the guaranteed-unique identifiers, the answer is of course UUIDs (which indeed the current RDF::Node implementation is already capable of), though because they are shorter, I propose the using the uuid-ncname representation I designed some time ago for just this purpose. See my monkey-patched implementation for what that looks like:

[8] pry(main)> n = RDF::Node.new
=> #<RDF::Node:0x3e8(_:Enc7Jm1BRnK9y9jKSMCzBJ)>
[9] pry(main)> n.id
=> "Enc7Jm1BRnK9y9jKSMCzBJ"
[10] pry(main)> UUID::NCName.from_ncname n.id, version: 1
=> "9dcec99b-5051-49ca-9f72-f63292302cc1"

The text was updated successfully, but these errors were encountered:

gkellogg · 2020-07-11T21:14:46Z

Here's another good one: I created rdf-lmdb a while ago but only recently noticed, incidentally when I had a structure containing a lot of RDF::Lists, that blank node IDs got derived from memory addresses that would reliably repeat across runs on Linux (although only occasionally on macOS which is why it took me so long to notice).

Yes, indeed, the offending code is here:

rdf/lib/rdf/model/node.rb

Line 83 in 2394451

@id = (id || "g#{__id__.to_i.abs}").to_s.freeze

The net effect is that I would get bnode structures in persistent storage that were all snarled up together.

If you're simply serializing with the default IDs, then, yes, that could be a problem.

Now, I should probably go and make some kind of bnode mapping table in rdf-lmdb, but I'm also wondering if it would make sense to generalize such a mapping table, since it's a common thing. That, or at least have some way to override RDF::Node's ID generating function in the singleton class (which is what I ended up doing to solve my short-term problem).

It's common for parsers to use such a pattern to distinguish the blank nodes read from those created, even though they may use the same identifiers, they become different objects. For example:

https://github.com/ruby-rdf/rdf-turtle/blob/develop/lib/rdf/turtle/reader.rb#L229-L233

def bnode(value = nil)
  return RDF::Node.new unless value
  @bnode_cache ||= {}
  @bnode_cache[value.to_s] ||= RDF::Node.new(value)
end

Anyway, I think what I am proposing is a module that one can include into, for example, a persistent RDF::Repository, that will automatically map whatever bnodes get thrown at it to guaranteed-unique (does that count as skolemized?)

No, not really, skolemized BNodes are intended to be used in serializations, which are IRIs, that can be turned back into the original BNodes. It makes sense for persistent stores, where there is some row identifier that can be retrieved. They are also not generic, but specific to a given server instance. I don't think they're used too much, and screw up things like lists, which depend on their being BNodes and not URIs.

identifiers that are passed along to the persistent store, and translated back on the way back out.

... but yes, sort of, in the way you describe using them.

This mapping would be kept around for the duration of the process. As for the guaranteed-unique identifiers, the answer is of course UUIDs (which indeed the current RDF::Node implementation is already capable of), though because they are shorter, I propose the using the uuid-ncname representation I designed some time ago for just this purpose. See my monkey-patched implementation for what that looks like:

[8] pry(main)> n = RDF::Node.new
=> #<RDF::Node:0x3e8(_:Enc7Jm1BRnK9y9jKSMCzBJ)>
[9] pry(main)> n.id
=> "Enc7Jm1BRnK9y9jKSMCzBJ"
[10] pry(main)> UUID::NCName.from_ncname n.id, version: 1
=> "9dcec99b-5051-49ca-9f72-f63292302cc1"

Maybe it makes sense to just change the default naming scheme to do something like this, and get away from object identifiers altogether. Would that obviate the need for a separate module?

Otherwise, then sure, a module with defined behavior that could be included in a Repository/DataStore implementation might be useful.

doriantaylor · 2020-07-11T22:15:54Z

I thought about that, but the user can always define the bnode identifier to be whatever they want. If that gets passed through as-is to a persistent store then that can cause problems (e.g. _:genid123).

I'm just thinking it's a common pattern just like you highlighted from the Turtle reader; it would merit an include-able utility module that just did the translation to something guaranteed-unique if the node ID wasn't already detectably that way.

gkellogg · 2020-07-11T22:19:07Z

Please go ahead and do a PR for this, then. @no-reply may want to chime in on it.

gkellogg added the help wanted label Oct 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blank nodes reuse memory addresses, causing problems for persistent stores #418

blank nodes reuse memory addresses, causing problems for persistent stores #418

doriantaylor commented Jul 11, 2020

gkellogg commented Jul 11, 2020

doriantaylor commented Jul 11, 2020

gkellogg commented Jul 11, 2020

blank nodes reuse memory addresses, causing problems for persistent stores #418

blank nodes reuse memory addresses, causing problems for persistent stores #418

Comments

doriantaylor commented Jul 11, 2020

gkellogg commented Jul 11, 2020

doriantaylor commented Jul 11, 2020

gkellogg commented Jul 11, 2020