forked from numpy/numpy
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request numpy#12174 from shoyer/nep-16-abstract-array
NEP 16 abstract arrays: rebased and marked as "Withdrawn"
- Loading branch information
Showing
3 changed files
with
386 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,328 @@ | ||
============================================================= | ||
NEP 16 — An abstract base class for identifying "duck arrays" | ||
============================================================= | ||
|
||
:Author: Nathaniel J. Smith <[email protected]> | ||
:Status: Withdrawn | ||
:Type: Standards Track | ||
:Created: 2018-03-06 | ||
:Resolution: https://github.com/numpy/numpy/pull/12174 | ||
|
||
.. note:: | ||
|
||
This NEP has been withdrawn in favor of the protocol based approach | ||
described in | ||
`NEP 22 <http://www.numpy.org/neps/nep-0022-ndarray-duck-typing-overview.html>`__ | ||
|
||
Abstract | ||
-------- | ||
|
||
We propose to add an abstract base class ``AbstractArray`` so that | ||
third-party classes can declare their ability to "quack like" an | ||
``ndarray``, and an ``asabstractarray`` function that performs | ||
similarly to ``asarray`` except that it passes through | ||
``AbstractArray`` instances unchanged. | ||
|
||
|
||
Detailed description | ||
-------------------- | ||
|
||
Many functions, in NumPy and in third-party packages, start with some | ||
code like:: | ||
|
||
def myfunc(a, b): | ||
a = np.asarray(a) | ||
b = np.asarray(b) | ||
... | ||
|
||
This ensures that ``a`` and ``b`` are ``np.ndarray`` objects, so | ||
``myfunc`` can carry on assuming that they'll act like ndarrays both | ||
semantically (at the Python level), and also in terms of how they're | ||
stored in memory (at the C level). But many of these functions only | ||
work with arrays at the Python level, which means that they don't | ||
actually need ``ndarray`` objects *per se*: they could work just as | ||
well with any Python object that "quacks like" an ndarray, such as | ||
sparse arrays, dask's lazy arrays, or xarray's labeled arrays. | ||
|
||
However, currently, there's no way for these libraries to express that | ||
their objects can quack like an ndarray, and there's no way for | ||
functions like ``myfunc`` to express that they'd be happy with | ||
anything that quacks like an ndarray. The purpose of this NEP is to | ||
provide those two features. | ||
|
||
Sometimes people suggest using ``np.asanyarray`` for this purpose, but | ||
unfortunately its semantics are exactly backwards: it guarantees that | ||
the object it returns uses the same memory layout as an ``ndarray``, | ||
but tells you nothing at all about its semantics, which makes it | ||
essentially impossible to use safely in practice. Indeed, the two | ||
``ndarray`` subclasses distributed with NumPy – ``np.matrix`` and | ||
``np.ma.masked_array`` – do have incompatible semantics, and if they | ||
were passed to a function like ``myfunc`` that doesn't check for them | ||
as a special-case, then it may silently return incorrect results. | ||
|
||
|
||
Declaring that an object can quack like an array | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
There are two basic approaches we could use for checking whether an | ||
object quacks like an array. We could check for a special attribute on | ||
the class:: | ||
|
||
def quacks_like_array(obj): | ||
return bool(getattr(type(obj), "__quacks_like_array__", False)) | ||
|
||
Or, we could define an `abstract base class (ABC) | ||
<https://docs.python.org/3/library/collections.abc.html>`__:: | ||
|
||
def quacks_like_array(obj): | ||
return isinstance(obj, AbstractArray) | ||
|
||
If you look at how ABCs work, this is essentially equivalent to | ||
keeping a global set of types that have been declared to implement the | ||
``AbstractArray`` interface, and then checking it for membership. | ||
|
||
Between these, the ABC approach seems to have a number of advantages: | ||
|
||
* It's Python's standard, "one obvious way" of doing this. | ||
|
||
* ABCs can be introspected (e.g. ``help(np.AbstractArray)`` does | ||
something useful). | ||
|
||
* ABCs can provide useful mixin methods. | ||
|
||
* ABCs integrate with other features like mypy type-checking, | ||
``functools.singledispatch``, etc. | ||
|
||
One obvious thing to check is whether this choice affects speed. Using | ||
the attached benchmark script on a CPython 3.7 prerelease (revision | ||
c4d77a661138d, self-compiled, no PGO), on a Thinkpad T450s running | ||
Linux, we find:: | ||
|
||
np.asarray(ndarray_obj) 330 ns | ||
np.asarray([]) 1400 ns | ||
|
||
Attribute check, success 80 ns | ||
Attribute check, failure 80 ns | ||
|
||
ABC, success via subclass 340 ns | ||
ABC, success via register() 700 ns | ||
ABC, failure 370 ns | ||
|
||
Notes: | ||
|
||
* The first two lines are included to put the other lines in context. | ||
|
||
* This used 3.7 because both ``getattr`` and ABCs are receiving | ||
substantial optimizations in this release, and it's more | ||
representative of the long-term future of Python. (Failed | ||
``getattr`` doesn't necessarily construct an exception object | ||
anymore, and ABCs were reimplemented in C.) | ||
|
||
* The "success" lines refer to cases where ``quacks_like_array`` would | ||
return True. The "failure" lines are cases where it would return | ||
False. | ||
|
||
* The first measurement for ABCs is subclasses defined like:: | ||
|
||
class MyArray(AbstractArray): | ||
... | ||
|
||
The second is for subclasses defined like:: | ||
|
||
class MyArray: | ||
... | ||
|
||
AbstractArray.register(MyArray) | ||
|
||
I don't know why there's such a large difference between these. | ||
|
||
In practice, either way we'd only do the full test after first | ||
checking for well-known types like ``ndarray``, ``list``, etc. `This | ||
is how NumPy currently checks for other double-underscore attributes | ||
<https://github.com/numpy/numpy/blob/master/numpy/core/src/private/get_attr_string.h>`__ | ||
and the same idea applies here to either approach. So these numbers | ||
won't affect the common case, just the case where we actually have an | ||
``AbstractArray``, or else another third-party object that will end up | ||
going through ``__array__`` or ``__array_interface__`` or end up as an | ||
object array. | ||
|
||
So in summary, using an ABC will be slightly slower than using an | ||
attribute, but this doesn't affect the most common paths, and the | ||
magnitude of slowdown is fairly small (~250 ns on an operation that | ||
already takes longer than that). Furthermore, we can potentially | ||
optimize this further (e.g. by keeping a tiny LRU cache of types that | ||
are known to be AbstractArray subclasses, on the assumption that most | ||
code will only use one or two of these types at a time), and it's very | ||
unclear that this even matters – if the speed of ``asarray`` no-op | ||
pass-throughs were a bottleneck that showed up in profiles, then | ||
probably we would have made them faster already! (It would be trivial | ||
to fast-path this, but we don't.) | ||
|
||
Given the semantic and usability advantages of ABCs, this seems like | ||
an acceptable trade-off. | ||
|
||
.. | ||
CPython 3.6 (from Debian):: | ||
Attribute check, success 110 ns | ||
Attribute check, failure 370 ns | ||
|
||
ABC, success via subclass 690 ns | ||
ABC, success via register() 690 ns | ||
ABC, failure 1220 ns | ||
|
||
|
||
Specification of ``asabstractarray`` | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Given ``AbstractArray``, the definition of ``asabstractarray`` is simple:: | ||
|
||
def asabstractarray(a, dtype=None): | ||
if isinstance(a, AbstractArray): | ||
if dtype is not None and dtype != a.dtype: | ||
return a.astype(dtype) | ||
return a | ||
return asarray(a, dtype=dtype) | ||
|
||
Things to note: | ||
|
||
* ``asarray`` also accepts an ``order=`` argument, but we don't | ||
include that here because it's about details of memory | ||
representation, and the whole point of this function is that you use | ||
it to declare that you don't care about details of memory | ||
representation. | ||
|
||
* Using the ``astype`` method allows the ``a`` object to decide how to | ||
implement casting for its particular type. | ||
|
||
* For strict compatibility with ``asarray``, we skip calling | ||
``astype`` when the dtype is already correct. Compare:: | ||
|
||
>>> a = np.arange(10) | ||
|
||
# astype() always returns a view: | ||
>>> a.astype(a.dtype) is a | ||
False | ||
|
||
# asarray() returns the original object if possible: | ||
>>> np.asarray(a, dtype=a.dtype) is a | ||
True | ||
|
||
|
||
What exactly are you promising if you inherit from ``AbstractArray``? | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
This will presumably be refined over time. The ideal of course is that | ||
your class should be indistinguishable from a real ``ndarray``, but | ||
nothing enforces that except the expectations of users. In practice, | ||
declaring that your class implements the ``AbstractArray`` interface | ||
simply means that it will start passing through ``asabstractarray``, | ||
and so by subclassing it you're saying that if some code works for | ||
``ndarray``\s but breaks for your class, then you're willing to accept | ||
bug reports on that. | ||
|
||
To start with, we should declare ``__array_ufunc__`` to be an abstract | ||
method, and add the ``NDArrayOperatorsMixin`` methods as mixin | ||
methods. | ||
|
||
Declaring ``astype`` as an ``@abstractmethod`` probably makes sense as | ||
well, since it's used by ``asabstractarray``. We might also want to go | ||
ahead and add some basic attributes like ``ndim``, ``shape``, | ||
``dtype``. | ||
|
||
Adding new abstract methods will be a bit tricky, because ABCs enforce | ||
these at subclass time; therefore, simply adding a new | ||
`@abstractmethod` will be a backwards compatibility break. If this | ||
becomes a problem then we can use some hacks to implement an | ||
`@upcoming_abstractmethod` decorator that only issues a warning if the | ||
method is missing, and treat it like a regular deprecation cycle. (In | ||
this case, the thing we'd be deprecating is "support for abstract | ||
arrays that are missing feature X".) | ||
|
||
|
||
Naming | ||
~~~~~~ | ||
|
||
The name of the ABC doesn't matter too much, because it will only be | ||
referenced rarely and in relatively specialized situations. The name | ||
of the function matters a lot, because most existing instances of | ||
``asarray`` should be replaced by this, and in the future it's what | ||
everyone should be reaching for by default unless they have a specific | ||
reason to use ``asarray`` instead. This suggests that its name really | ||
should be *shorter* and *more memorable* than ``asarray``... which | ||
is difficult. I've used ``asabstractarray`` in this draft, but I'm not | ||
really happy with it, because it's too long and people are unlikely to | ||
start using it by habit without endless exhortations. | ||
|
||
One option would be to actually change ``asarray``\'s semantics so | ||
that *it* passes through ``AbstractArray`` objects unchanged. But I'm | ||
worried that there may be a lot of code out there that calls | ||
``asarray`` and then passes the result into some C function that | ||
doesn't do any further type checking (because it knows that its caller | ||
has already used ``asarray``). If we allow ``asarray`` to return | ||
``AbstractArray`` objects, and then someone calls one of these C | ||
wrappers and passes it an ``AbstractArray`` object like a sparse | ||
array, then they'll get a segfault. Right now, in the same situation, | ||
``asarray`` will instead invoke the object's ``__array__`` method, or | ||
use the buffer interface to make a view, or pass through an array with | ||
object dtype, or raise an error, or similar. Probably none of these | ||
outcomes are actually desireable in most cases, so maybe making it a | ||
segfault instead would be OK? But it's dangerous given that we don't | ||
know how common such code is. OTOH, if we were starting from scratch | ||
then this would probably be the ideal solution. | ||
|
||
We can't use ``asanyarray`` or ``array``, since those are already | ||
taken. | ||
|
||
Any other ideas? ``np.cast``, ``np.coerce``? | ||
|
||
|
||
Implementation | ||
-------------- | ||
|
||
1. Rename ``NDArrayOperatorsMixin`` to ``AbstractArray`` (leaving | ||
behind an alias for backwards compatibility) and make it an ABC. | ||
|
||
2. Add ``asabstractarray`` (or whatever we end up calling it), and | ||
probably a C API equivalent. | ||
|
||
3. Begin migrating NumPy internal functions to using | ||
``asabstractarray`` where appropriate. | ||
|
||
|
||
Backward compatibility | ||
---------------------- | ||
|
||
This is purely a new feature, so there are no compatibility issues. | ||
(Unless we decide to change the semantics of ``asarray`` itself.) | ||
|
||
|
||
Rejected alternatives | ||
--------------------- | ||
|
||
One suggestion that has come up is to define multiple abstract classes | ||
for different subsets of the array interface. Nothing in this proposal | ||
stops either NumPy or third-parties from doing this in the future, but | ||
it's very difficult to guess ahead of time which subsets would be | ||
useful. Also, "the full ndarray interface" is something that existing | ||
libraries are written to expect (because they work with actual | ||
ndarrays) and test (because they test with actual ndarrays), so it's | ||
by far the easiest place to start. | ||
|
||
|
||
Links to discussion | ||
------------------- | ||
|
||
* https://mail.python.org/pipermail/numpy-discussion/2018-March/077767.html | ||
|
||
|
||
Appendix: Benchmark script | ||
-------------------------- | ||
|
||
.. literalinclude:: nep-0016-benchmark.py | ||
|
||
|
||
Copyright | ||
--------- | ||
|
||
This document has been placed in the public domain. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
import perf | ||
import abc | ||
import numpy as np | ||
|
||
class NotArray: | ||
pass | ||
|
||
class AttrArray: | ||
__array_implementer__ = True | ||
|
||
class ArrayBase(abc.ABC): | ||
pass | ||
|
||
class ABCArray1(ArrayBase): | ||
pass | ||
|
||
class ABCArray2: | ||
pass | ||
|
||
ArrayBase.register(ABCArray2) | ||
|
||
not_array = NotArray() | ||
attr_array = AttrArray() | ||
abc_array_1 = ABCArray1() | ||
abc_array_2 = ABCArray2() | ||
|
||
# Make sure ABC cache is primed | ||
isinstance(not_array, ArrayBase) | ||
isinstance(abc_array_1, ArrayBase) | ||
isinstance(abc_array_2, ArrayBase) | ||
|
||
runner = perf.Runner() | ||
def t(name, statement): | ||
runner.timeit(name, statement, globals=globals()) | ||
|
||
t("np.asarray([])", "np.asarray([])") | ||
arrobj = np.array([]) | ||
t("np.asarray(arrobj)", "np.asarray(arrobj)") | ||
|
||
t("attr, False", | ||
"getattr(not_array, '__array_implementer__', False)") | ||
t("attr, True", | ||
"getattr(attr_array, '__array_implementer__', False)") | ||
|
||
t("ABC, False", "isinstance(not_array, ArrayBase)") | ||
t("ABC, True, via inheritance", "isinstance(abc_array_1, ArrayBase)") | ||
t("ABC, True, via register", "isinstance(abc_array_2, ArrayBase)") | ||
|