Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More type hints for rdflib.graph and related #1853

Merged

Conversation

aucampia
Copy link
Member

@aucampia aucampia commented Apr 18, 2022

Summary of changes

This patch primarily adds more type hints for rdflib.graph, but also
adds type hints to some related modules in order to work with the new
type hints for rdflib.graph.

I'm mainly doing this as a baseline for adding type hints to
rdflib.store.

I have created type aliases to make it easier to type everything
cosnsitently and to make type hints easier easier to change in the
future. The type aliases are private however (i.e. _-prefixed) and
should be kept as such for now.

This patch only contains typing changes and does not change runtime
behaviour.

Broken off from #1850

Checklist

  • Checked that there aren't other open pull requests for
    the same change.
  • Checked that all tests and type checking passes.
  • Considered granting push permissions to the PR branch,
    so maintainers can fix minor issues and keep your PR up to date.

@aucampia aucampia changed the title More typing for rdflib.graph More type hints for rdflib.graph Apr 18, 2022
@aucampia aucampia force-pushed the iwana-20220418T1811-more_graph_typing branch from 16455d3 to 575d4df Compare April 18, 2022 21:02
@aucampia aucampia changed the title More type hints for rdflib.graph More type hints for rdflib.graph and related Apr 18, 2022
@aucampia aucampia force-pushed the iwana-20220418T1811-more_graph_typing branch from 575d4df to e8e90fb Compare April 18, 2022 21:08
@aucampia aucampia changed the title More type hints for rdflib.graph and related More type hints for rdflib.graph and related Apr 18, 2022
@aucampia aucampia marked this pull request as ready for review April 18, 2022 21:29
@aucampia
Copy link
Member Author

@ajnelson-nist please have a look if you can spare some time and feel comfortable doing a review.

@aucampia aucampia force-pushed the iwana-20220418T1811-more_graph_typing branch from e8e90fb to 71a1446 Compare April 19, 2022 07:17
@ajnelson-nist
Copy link
Contributor

It seems a shame the _TriplePatternType type isn't used in the one spot I most expected it to be used: The definition of Graph.triples. At least it did get used in the first @overload signature for Graph.triples.

I'm also surprised so many of the type references need to be quoted strings, but I'll trust this was an effect of some need of the type-checking system.

Overall, this looks like good work in the right direction. I'm happier with the distinct type names for the subject, predicate, and object positions.

@aucampia aucampia force-pushed the iwana-20220418T1811-more_graph_typing branch from b8161a2 to faf95a9 Compare April 19, 2022 19:02
@aucampia
Copy link
Member Author

It seems a shame the _TriplePatternType type isn't used in the one spot I most expected it to be used: The definition of Graph.triples. At least it did get used in the first @overload signature for Graph.triples.

I double checked now, and the second overload was actually wrong and I fixed it in faf95a9, overloads are now as follow:

rdflib/rdflib/graph.py

Lines 473 to 498 in faf95a9

@overload
def triples(
self,
triple: "_TriplePatternType",
) -> Generator["_TripleType", None, None]:
...
@overload
def triples(
self,
triple: Tuple[Optional["_SubjectType"], Path, Optional["_ObjectType"]],
) -> Generator[Tuple["_SubjectType", Path, "_ObjectType"], None, None]:
...
@overload
def triples(
self,
triple: Tuple[
Optional["_SubjectType"],
Union[None, Path, "_PredicateType"],
Optional["_ObjectType"],
],
) -> Generator[
Tuple["_SubjectType", Union["_PredicateType", Path], "_ObjectType"], None, None
]:
...

The third overload is identical to the implementation, and a catch all for cases where it receives an undifferentiated predicate which is needed for the implementations of Graph.{subjects,predicates,objects,...} which passes predicate of type Union[None, Path, "_PredicateType"].

I'm also surprised so many of the type references need to be quoted strings, but I'll trust this was an effect of some need of the type-checking system.

They don't really, but I had to move it in and out of if TYPE_CHECKING: while trying to figure out how to get sphinx to still correctly document types with aliases (see #1854) guards a couple of times, and I think I will likely have to do it in places in future again also so I figured it is best to just leave it in quotes as it is safer anyway. We can (should?) maybe look at switching to from __future__ import annotations in which case it is not needed, but to me it is about the same, this does not really affect users of RDFLib though.

Overall, this looks like good work in the right direction. I'm happier with the distinct type names for the subject, predicate, and object positions.

In part I went for this because I could not really decide what the best restrictions are for them. Currently it is:

rdflib/rdflib/graph.py

Lines 49 to 58 in 71a1446

_SubjectType = IdentifiedNode
_PredicateType = IdentifiedNode
_ObjectType = Identifier
_TripleType = Tuple["_SubjectType", "_PredicateType", "_ObjectType"]
_QuadType = Tuple["_SubjectType", "_PredicateType", "_ObjectType", "Graph"]
_TriplePatternType = Tuple[
Optional["_SubjectType"], Optional["_PredicateType"], Optional["_ObjectType"]
]
_GraphT = TypeVar("_GraphT", bound="Graph")

However, _PredicateType should technically be URIRef - as per the spec - but I think N3 actually does allow blank node predicates, and then _ObjectType should technically be Union[Literal, URIRef, BNode] - because variables should not be used in graphs I think - but this makes things needlessly fiddly also. Using type aliases makes it easy to change without having to go adjust each place.

Thanks for having a look.

@aucampia
Copy link
Member Author

@eggplants any feedback from you would also be appreciated.

@ghost
Copy link

ghost commented Apr 19, 2022

N3 goes well beyond the RDF spec:

def test_triple_types():
    from rdflib import _typing
    from rdflib.graph import QuotedGraph
    from rdflib.paths import SequencePath
    from rdflib.term import Variable

    # In Notation3, predicates can be one of a variety of Path subclasses

    g = Graph()

    g = g.parse(data='''
@prefix : <ex:> .

:a :p1 :c ; :p2 :f .
:c :p2 :e ; :p3 :g .
:g :p3 :h ; :p2 :j .
:h :p3 :a ; :p2 :g .

:q :px :q .

''', format='n3')

    e = Namespace('ex:')

    triple = (e.a, e.p1/e.p2, e.e)

    assert triple in g

    with pytest.raises(AssertionError):
        assert type(triple) == _typing._TripleType

    assert type(triple[1]) == SequencePath


    # In Notation3 subjects and objects can be QuotedGraphs

    g = Graph()
    g.parse(data="""
@prefix log: <http://www.w3.org/2000/10/swap/log#> .
:bob :hasmother :alice .
:john :hasmother :alice .

@forAll :x, :y, :z .
{ :x :hasmother :z . :y :hasmother :z } log:implies { :x :issibling :y } .
""", format="n3")

    triple = list(g.triples(
        (None, URIRef("http://www.w3.org/2000/10/swap/log#implies"), None)))[0]

    with pytest.raises(AssertionError):
        assert type(triple) == _typing._TripleType

    assert type(triple[0]) == QuotedGraph


    # In QuotedGraphs, subjects and objects can be Variable

    triple = list(sorted([t for t in triple[0]]))[0]

    with pytest.raises(AssertionError):
        assert type(triple) == _typing._TripleType

    assert type(triple[0]) == Variable

@aucampia
Copy link
Member Author

@gjhiggins good point, I was actually just noticing this when working on #1701

I think we have to expand the types as follow to accommodate N3:

 _SubjectType = Union[Identifier, "QuotedGraph"]
 _PredicateType = Union[Identifier, "QuotedGraph"] 
 _ObjectType = Union[Identifier, "QuotedGraph"] 

@aucampia
Copy link
Member Author

A bit of a pity that there is no abstract syntax spec for N3's graph model.

@aucampia
Copy link
Member Author

aucampia commented Apr 19, 2022

 _SubjectType = Union[Identifier, "QuotedGraph"]
 _PredicateType = Union[Identifier, "QuotedGraph"] 
 _ObjectType = Union[Identifier, "QuotedGraph"] 

To clarify why:

AFAIU, variables can be parts of graphs in N3, and Literals and BNodes can be any of subject, predicate or object. Possibly we could create a helper Protocol that exposes an interface which only allows RDF 1.1 compliant triples but as Graph can contain much more we have to keep the typing broad enough.

@aucampia
Copy link
Member Author

Marking this as draft, will be still be good to get some more input here but I will add some more tests for quotedgraph first and ensure that the typing actually accommodates this.

@aucampia aucampia marked this pull request as draft April 19, 2022 21:58
@eggplants
Copy link
Contributor

LTGM. I can't wait to get type annotations on all of the rdflib!
By the way, from __future__ import annotations is a good way to reduce the amount of description by eliminating the cumbersome imports and quotations related to types. Only recently have I begun to adopt it.

tips: I recommend pyupgrade to help you rewrite it.

@ghost
Copy link

ghost commented Apr 20, 2022

A bit of a pity that there is no abstract syntax spec for N3's graph model.

There sort of is. There's an EBNF expression in the more extensive description of Notation3 which references a downloadable ebnf file.

IIRC you mentioned the possibility of using the Lark parser, fwiw I took a pass by pymantic's use of the Lark parser (in which they feed lark with a string containing the EBNF expression) in the context of the EBNF descriptions provided in the RDF Star spec.

_ObjectType = Identifier

_TripleType = Tuple["_SubjectType", "_PredicateType", "_ObjectType"]
_QuadType = Tuple["_SubjectType", "_PredicateType", "_ObjectType", "Graph"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may have to be updated if @gjhiggins PR's come through!

@aucampia
Copy link
Member Author

I will republish this after #1858 is merged, likely the types will become:

_SubjectType = Node
_PredicateType = Node
_ObjectType = Node

This is mainly because N3 allows a much broader range of subjects, predicates and objects.

@aucampia aucampia force-pushed the iwana-20220418T1811-more_graph_typing branch 2 times, most recently from 4a7ea04 to 5cb10c6 Compare May 21, 2022 15:30
@aucampia aucampia marked this pull request as ready for review May 21, 2022 15:36
@aucampia
Copy link
Member Author

#1858 is merged and I updated this to support N3, the typing is now broader.

The highlight is likely this:

rdflib/rdflib/graph.py

Lines 43 to 63 in 5cb10c6

_SubjectType = Node
_PredicateType = Node
_ObjectType = Node
_TripleType = Tuple["_SubjectType", "_PredicateType", "_ObjectType"]
_QuadType = Tuple["_SubjectType", "_PredicateType", "_ObjectType", "Graph"]
_OptionalQuadType = Tuple[
"_SubjectType", "_PredicateType", "_ObjectType", Optional["Graph"]
]
_OptionalIdentifiedQuadType = Tuple[
"_SubjectType", "_PredicateType", "_ObjectType", Optional["Node"]
]
_TriplePatternType = Tuple[
Optional["_SubjectType"], Optional["_PredicateType"], Optional["_ObjectType"]
]
_QuadPatternType = Tuple[
Optional["_SubjectType"],
Optional["_PredicateType"],
Optional["_ObjectType"],
Optional["Graph"],
]

@aucampia aucampia requested a review from a team May 21, 2022 15:37
@aucampia aucampia added the review wanted This indicates that the PR is ready for review label May 21, 2022
@coveralls
Copy link

coveralls commented May 21, 2022

Coverage Status

Coverage increased (+0.007%) to 88.521% when pulling a79650d on aucampia:iwana-20220418T1811-more_graph_typing into 958b9a1 on RDFLib:master.

@aucampia aucampia force-pushed the iwana-20220418T1811-more_graph_typing branch from 5cb10c6 to 7d0da30 Compare May 21, 2022 21:33
Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

aucampia added 3 commits May 25, 2022 20:22
This patch primarily adds more type hints for `rdflib.graph`, but also
adds type hints to some related modules in order to work with the new
type hints for `rdflib.graph`.

I'm mainly doing this as a baseline for adding type hints to
`rdflib.store`.

I have created type aliases to make it easier to type everything
cosnsitently and to make type hints easier easier to change in the
future. The type aliases are private however (i.e. `_`-prefixed) and
should be kept as such for now.

This patch only contains typing changes and does not change runtime
behaviour.

Broken off from RDFLib#1850
- Fixed second overload of `_TripleType` to return `Path` instead of
  `_PredicateType`.
Changed type hints based on more information discovered in mean time.
@aucampia aucampia force-pushed the iwana-20220418T1811-more_graph_typing branch from 7d0da30 to a79650d Compare May 25, 2022 18:22
@ajnelson-nist
Copy link
Contributor

#1858 is merged and I updated this to support N3, the typing is now broader.

The highlight is likely this:

rdflib/rdflib/graph.py

Lines 43 to 63 in 5cb10c6

_SubjectType = Node
_PredicateType = Node
_ObjectType = Node
_TripleType = Tuple["_SubjectType", "_PredicateType", "_ObjectType"]
_QuadType = Tuple["_SubjectType", "_PredicateType", "_ObjectType", "Graph"]
_OptionalQuadType = Tuple[
"_SubjectType", "_PredicateType", "_ObjectType", Optional["Graph"]
]
_OptionalIdentifiedQuadType = Tuple[
"_SubjectType", "_PredicateType", "_ObjectType", Optional["Node"]
]
_TriplePatternType = Tuple[
Optional["_SubjectType"], Optional["_PredicateType"], Optional["_ObjectType"]
]
_QuadPatternType = Tuple[
Optional["_SubjectType"],
Optional["_PredicateType"],
Optional["_ObjectType"],
Optional["Graph"],
]

I'm a little disappointed (I hope this doesn't come across as being stern! or upset! Just bummed.) with this line:

_SubjectType = Node

My original goal with nudging type review was to assist with catching thing-or-string errors, particularly easy to do through various context dictionary and/or typo scenarios in JSON-LD.

Moving _SubjectType to Node once again permits a Literal to be a subject. I appreciate rdflib's need to provide broad support, and N3 makes this requirement.

Meanwhile, I've thought about this remark earlier, and I'm not sure it will work:

AFAIU, variables can be parts of graphs in N3, and Literals and BNodes can be any of subject, predicate or object. Possibly we could create a helper Protocol that exposes an interface which only allows RDF 1.1 compliant triples but as Graph can contain much more we have to keep the typing broad enough.

I'm not so sure about the typing.Protocol strategy because I don't know how hard it would be to provide just alternative signatures for things like the Graph.triples() methods....but, I don't have a lot of experience implementing typing.Protocol.

However, I am aware that subclassing Graph to e.g. Graph11 (starter name for a Graph meeting RDF 1.1 requirements) and tightening the function signatures with e.g. Graph11.triples(s: IdentifiedNode, p: IdentifiedNode, o: Node) will fail mypy with a violation of the Liskov Substitution Principle. If Graph says the subject parameter can be any Node in the type signature, all Graph subclasses' type signatures also have to support any Node for the subject parameter.

So, there's potentially a significant (to the type system) downstream consequence to supporting Graph.triples(s: Node, p: Node, o: Node). I'm aware that assert isinstance() can implement runtime narrower type assurances in subclasses, but that's not the same as catching type errors with static review before code deployment.

One solution to this (caveat - given not too much thought) might be implementing everything but the iterators in an abstract base class _Graph, and requiring users to instantiate a subclass, GraphN3 or Graph11, both of which handle the appropriate iterators' type signatures. My reflex on writing that sentence is it feels like a significant intrusion of the type system on user experience. I hope I've overthought this but missed something that makes it a non-issue.

@aucampia
Copy link
Member Author

I'm not so sure about the typing.Protocol strategy because I don't know how hard it would be to provide just alternative signatures for things like the Graph.triples() methods....but, I don't have a lot of experience implementing typing.Protocol.

Thinking about it, I'm not so sure it will work either without an error, but it could still work using cast(...) - but I will have to test it.

However, I am aware that subclassing Graph to e.g. Graph11 (starter name for a Graph meeting RDF 1.1 requirements) and tightening the function signatures with e.g. Graph11.triples(s: IdentifiedNode, p: IdentifiedNode, o: Node) will fail mypy with a violation of the Liskov Substitution Principle.

On this, there are quite a lot of Liskov Substitution Principle violations already, we should not add more, but fixing them will take time, my primary goal with contributions at the moment is to get things into the best state we can have it with the current design and interface, and have as much validation as possible, both static and dynamic, and only then start re-architecting.

Your suggestion of having a more narrow/constained Graph base class makes a lot of sense, and we should try and head towards that so we can fix all the type violations. If we could move towards that while maintaining backwards compatibility it will be ideal.

There may also be some options with generics, something like this:

RDFGraph = Graph[IdentifiedNode, URIRef, Identifier]
N3Graph = Graph[Node, Node, Node]

Ideally in such cases the Graph should do some strict validation also, but I can't say I have the perfect answer as it is now.

In short term I think the easiest will be a Protocol that one can cast to, but even if I make this I will only keep it in rdflib._experimental as a start, because I'm not sure we want to commit ourselves to that route.

For now I have little choice but to go with it as is, but it also does not sit right with me, as this significantly reduces the level of static validation offered to users.

@aucampia aucampia merged commit 21e737d into RDFLib:master May 26, 2022
@ghost ghost removed the review wanted This indicates that the PR is ready for review label Jun 4, 2022
@aucampia aucampia deleted the iwana-20220418T1811-more_graph_typing branch June 4, 2022 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants