Fix/1369 custom resolver #1385

alexdutton · 2021-08-04T13:47:38Z

Proposed Changes

This adds a new Resolver framework, and updates the JSON-LD parser to use it.

As a result, implementors have much more control over which URLs referenced in JSON-LD documents will be resolved.

This PR implements a default-deny approach to URL resolution, which will need explaining in release notes and documentation. Implementors can revert to the old behaviour via environment variable if they are happy with the risks.

This also provides a base for future extensibility, e.g. caching or using locally-held copies of external resources.

No tests or documentation yet, but all existing tests pass when using the new PermissiveResolver.

This adds a new Resolver class which can be overridden with a call to `set_default_resolver()`. This is the start of resolving RDFLib#1369, and we'll add more functionality to the Resolver class in subsequent commits.

This adds individual methods for resolving file: and http[s]: URLs, and a mechanism to extend for additional schemes and behaviour in subclasses.

To maintain the existing behaviour of create_input_source, we add a `trust` parameter which bypasses the new `is_resolution_allowed()` check. Our default otherwise is to say no, with other behaviour currently configurable using `set_default_resolver()`. At the moment there is no code that uses resolvers directly, except for `create_input_source()`. Soon, we'll update the JSON-LD parser to use it directly (i.e. with `trust=False`), at which point the new resolution restrictions will take effect.

These provide, respectively: 1. a straightforward way to allow-list schemes and URLs by environment variable or instantiation parameter, as the default resolver. 2. a way to allow all resolution, which was the pre-6.0.x behaviour.

People deploying can now use e.g. `RDFLIB_DEFAULT_RESOLVER_CLASS=rdflib.resolver.PermissiveResolver` to change resolution behaviour without having to resort to code. It doesn't provide a way to specify instantiation arguments for the given resolver, but that's fine. Setting the resolver is now performed while holding a `threading.Lock()`. We find the default resolver lazily so to avoid any import loops where a custom resolver (necessarily) imports the `rdflib.resolver` module.

We don't need to return anything other than the InputSource. We ensure that the system ID is set, as that should contain the absolute_location. The previous return values can be reconstructed in the internal `_create_input_source_from_location` function.

We pass an explicit resolver around, defaulting to `get_default_resolver()` if not specified. `source_to_json()` has been removed as it uses `create_input_source()`, which trusts all locations passed to it. Instead, we parse json explicitly in the two places where it was called. Resolves RDFLib#1369.

nicholascar · 2021-08-28T06:50:33Z

Hi @alexdutton I am watching this PR with interest - I think it's a great idea to have "everything" available to RDFlib users, and, if scary, off by default - so look forward to you bringing it out of draft.

Just a couple of questions:

Has this work of yours been motivated by the JSON-LD parser & serializer now being in RDFlib core?
Do you have your own code like this in action already, just not in RDFlib?
Can you please add some notes to the top of the new file rdflib/resolver.py to explains what it's doing, perhaps something like rdflib/compare.py (though doesn't have to be as long!) This just helps with people and file browsing

alexdutton · 2021-09-14T11:34:38Z

Hi @nicholascar

I didn't realise the JSON-LD support had been moved into core when I first realised this was an issue, but still saw the fix as being in core as that's where the resolution behaviour was.
No, I don't have similar code elsewhere. This was precipitated by needing to implement a Linked Data Notifications inbox for the COAR Notify initiative, which involves receiving JSON-LD documents to a publicly accessible HTTP endpoint.
Yep, I'll do that now :-)

This is currently the main documentation for the resolver functionality.

Before this, there were two different parameter names across the `is_resolution_allowed` and `resolve_*` methods. They are now both called `url`, and are now explicitly both passed as `str` objects instead of `URIRef` objects.

nicholascar · 2021-11-24T13:49:50Z

@alexdutton this PR is still in draft. Are you planning on taking it out of draft soon so we can review it formally?

ramcaat · 2022-12-07T15:52:50Z

This is a critical vulnerability. Will this be fixed in 6.3 release? Do we have a date for it?

aucampia · 2022-12-11T19:42:00Z

This is a critical vulnerability. Will this be fixed in 6.3 release? Do we have a date for it?

The plan is to fix in in 6.3, but sadly I have very little time to work on this. I will be having some days off work later this month and I hope to finish it then.

alexdutton force-pushed the fix/1369-custom-resolver branch 2 times, most recently from 6ecbac9 to f511df9 Compare August 4, 2021 13:53

alexdutton added 9 commits August 4, 2021 15:07

Move _create_input_source_from_location into overrideable class

2d6f955

This adds a new Resolver class which can be overridden with a call to `set_default_resolver()`. This is the start of resolving RDFLib#1369, and we'll add more functionality to the Resolver class in subsequent commits.

Add two new exceptions for URL resolution errors

a3eaf9e

Factor out URL resolvers to provide extensibility/customisation

73920ea

This adds individual methods for resolving file: and http[s]: URLs, and a mechanism to extend for additional schemes and behaviour in subclasses.

Allow explicit resolvers to be passed to parsers.

90254ec

alexdutton force-pushed the fix/1369-custom-resolver branch from f511df9 to 9455370 Compare August 4, 2021 14:11

nicholascar requested review from nicholascar, ashleysommer and white-gecko and removed request for nicholascar August 28, 2021 06:39

alexdutton added 3 commits September 14, 2021 17:30

Add a module docstring to rdflib.resolver

c5737fe

This is currently the main documentation for the resolver functionality.

fixup! Add two new exceptions for URL resolution errors

4a513c8

Standardise location and url to a url str in resolvers

3cb4c4c

Before this, there were two different parameter names across the `is_resolution_allowed` and `resolve_*` methods. They are now both called `url`, and are now explicitly both passed as `str` objects instead of `URIRef` objects.

aucampia linked an issue Apr 18, 2022 that may be closed by this pull request

URLInputSource can be abused to retrieve arbitrary documents if used naïvely #1844

Closed

This was referenced Jun 7, 2022

RDFLib vulnerability CasperWA/turtle-canon#71

Closed

RDFLib vulnerability emmo-repo/CIF-ontology#109

Closed

RDFLib vulnerability emmo-repo/EMMOntoPy#424

Open

ghost mentioned this pull request Jun 20, 2022

rebase of Alex Dutton's Draft PR #1385 #1995

Closed

8 tasks

aucampia mentioned this pull request Jul 29, 2022

URLInputSource can be abused to retrieve arbitrary documents if used naïvely #1844

Closed

ramcaat mentioned this pull request Dec 5, 2022

BUILD: 2.3 morph-kgc/morph-kgc#116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/1369 custom resolver #1385

Fix/1369 custom resolver #1385

alexdutton commented Aug 4, 2021

nicholascar commented Aug 28, 2021

alexdutton commented Sep 14, 2021

nicholascar commented Nov 24, 2021

ramcaat commented Dec 7, 2022

aucampia commented Dec 11, 2022

Fix/1369 custom resolver #1385

Are you sure you want to change the base?

Fix/1369 custom resolver #1385

Conversation

alexdutton commented Aug 4, 2021

Proposed Changes

nicholascar commented Aug 28, 2021

alexdutton commented Sep 14, 2021

nicholascar commented Nov 24, 2021

ramcaat commented Dec 7, 2022

aucampia commented Dec 11, 2022