Merge pull request numpy#25347 from ngoldbaum/stringdtype

flexatone · Feb 7, 2024 · cd0e99a · cd0e99a
2 parents 64775d6 + ced4336
commit cd0e99a
Show file tree

Hide file tree

Showing 49 changed files with 10,423 additions and 569 deletions.
diff --git a/doc/neps/nep-0055-string_dtype.rst b/doc/neps/nep-0055-string_dtype.rst
@@ -624,13 +624,13 @@ following struct public:
        // The object representing a null value
        PyObject *na_object;
        // Flag indicating whether or not to coerce arbitrary objects to strings
-       int coerce;
+       char coerce;
        // Flag indicating the na object is NaN-like
-       int has_nan_na;
+       char has_nan_na;
        // Flag indicating the na object is a string
-       int has_string_na;
+       char has_string_na;
        // If nonzero, indicates that this instance is owned by an array already
-       int array_owned;
+       char array_owned;
        // The string data to use when a default string is needed
        npy_static_string default_string;
        // The name of the missing data object, if any
@@ -1003,7 +1003,7 @@ downstream uses of object string arrays that mutate array elements that we would
 like to support.
 
 Instead, we have opted to pair the ``npy_string_allocator`` instance attached to
-``StringDType`` instances with a ``PyThread_type_lock`` mutex. Any function in
+``PyArray_StringDType`` instances with a ``PyThread_type_lock`` mutex. Any function in
 the static string C API that allows manipulating heap-allocated data accepts an
 ``allocator`` argument. To use the C API correctly, a thread must acquire the
 allocator mutex before any usage of the ``allocator``.

diff --git a/doc/release/upcoming_changes/25347.c_api.rst b/doc/release/upcoming_changes/25347.c_api.rst
@@ -0,0 +1,9 @@
+* A C API for working with `numpy.dtypes.StringDType` arrays has been
+  exposed. This includes functions for acquiring and releasing mutexes locking
+  access to the string data as well as packing and unpacking UTF-8 bytestreams
+  from array entries.
+* ``NPY_NTYPES`` is now version-dependent to accomadate the availability of a
+  new NumPy built-in DType. It is now defined in ``npy2_compat.h``.
+* ``NPY_NTYPES_LEGACY`` is now defined in ``ndarraytypes.h``, use this value
+  if you do not wish to update code to support dtypes written using the new
+  DType API that may not function in the same way as legacy dtypes.
diff --git a/doc/release/upcoming_changes/25347.new_feature.rst b/doc/release/upcoming_changes/25347.new_feature.rst
@@ -0,0 +1,9 @@
+StringDType has been added to NumPy
+-----------------------------------
+
+We have added a new variable-width UTF-8 encoded string data type, implementing
+a "NumPy array of python strings", including support for a user-provided missing
+data sentinel. It is intended as a drop-in replacement for arrays of python
+strings and missing data sentinels using the object dtype. See `NEP 55
+<https://numpy.org/neps/nep-0055-string_dtype.html>`_ and :ref:`the
+documentation <stringdtype>` for more details.
diff --git a/doc/source/reference/c-api/dtype.rst b/doc/source/reference/c-api/dtype.rst
@@ -3,7 +3,7 @@ Data type API
 
 .. sectionauthor:: Travis E. Oliphant
 
-The standard array can have 24 different data types (and has some
+The standard array can have 25 different data types (and has some
 support for adding your own types). These data types all have an
 enumerated type, an enumerated type-character, and a corresponding
 array scalar Python type object (placed in a hierarchy). There are
@@ -27,7 +27,7 @@ Enumerated types
 
 .. c:enum:: NPY_TYPES
 
-    There is a list of enumerated types defined providing the basic 24
+    There is a list of enumerated types defined providing the basic 25
     data types plus some useful generic names. Whenever the code requires
     a type number, one of these enumerated types is requested. The types
     are all called ``NPY_{NAME}``:
@@ -139,14 +139,21 @@ Enumerated types
 
     .. c:enumerator:: NPY_STRING
 
-        The enumeration value for ASCII strings of a selectable size. The
-        strings have a fixed maximum size within a given array.
+        The enumeration value for null-padded byte strings of a selectable
+        size. The strings have a fixed maximum size within a given array.
 
     .. c:enumerator:: NPY_UNICODE
 
         The enumeration value for UCS4 strings of a selectable size. The
         strings have a fixed maximum size within a given array.
 
+    .. c:enumerator:: NPY_VSTRING
+
+        The enumeration value for UTF-8 variable-width strings. Note that this
+        dtype holds an array of references, with string data stored outside of
+        the array buffer. Use the C API for working with numpy variable-width
+        static strings to access the string data in each array entry.
+
     .. c:enumerator:: NPY_OBJECT
 
         The enumeration value for references to arbitrary Python objects.
@@ -188,6 +195,15 @@ Other useful related constants are
     The total number of built-in NumPy types. The enumeration covers
     the range from 0 to NPY_NTYPES-1.
 
+.. c:macro:: NPY_NTYPES_LEGACY
+
+    The number of built-in NumPy types written using the legacy DType
+    system. New NumPy dtypes will be written using the new DType API and may not
+    function in the same manner as legacy DTypes. Use this macro if you want to
+    handle legacy DTypes using different code paths or if you do not want to
+    update code that uses ``NPY_NTYPES`` and does not work correctly with new
+    DTypes.
+
 .. c:macro:: NPY_NOTYPE
 
     A signal value guaranteed not to be a valid type enumeration number.
@@ -205,7 +221,7 @@ is ``NPY_{NAME}LTR`` where ``{NAME}`` can be
     **UINT**, **LONG**, **ULONG**, **LONGLONG**, **ULONGLONG**,
     **HALF**, **FLOAT**, **DOUBLE**, **LONGDOUBLE**, **CFLOAT**,
     **CDOUBLE**, **CLONGDOUBLE**, **DATETIME**, **TIMEDELTA**,
-    **OBJECT**, **STRING**, **VOID**
+    **OBJECT**, **STRING**, **UNICODE**, **VSTRING**, **VOID**
 
     **INTP**, **UINTP**
 

diff --git a/doc/source/user/basics.rst b/doc/source/user/basics.rst
@@ -15,5 +15,6 @@ fundamental NumPy ideas and philosophy.
    basics.types
    basics.broadcasting
    basics.copies
+   basics.strings
    basics.rec
    basics.ufuncs
diff --git a/doc/source/user/basics.strings.rst b/doc/source/user/basics.strings.rst
@@ -0,0 +1,241 @@
+.. _basics.strings:
+
+****************************************
+Working with Arrays of Strings And Bytes
+****************************************
+
+While NumPy is primarily a numerical library, it is often convenient
+to work with NumPy arrays of strings or bytes. The two most common
+use cases are:
+
+    * Working with data loaded or memory-mapped from a data file,
+      where one or more of the fields in the data is a string or
+      bytestring, and the maximum lenth of the field is known
+      ahead of time. This often is used for a name or label field.
+    * Using NumPy indexing and broadcasting with arrays of Python
+      strings of unknown length, which may or may not have data
+      defined for every value.
+
+For the first use case, NumPy provides the fixed-width `numpy.void`,
+`numpy.str_` and `numpy.bytes_` data types. For the second use case,
+numpy provides `numpy.dtypes.StringDType`. Below we describe how to
+work with both fixed-width and variable-width string arrays, how to
+convert between the two representations, and provide some advide for
+most efficiently working with string data in NumPy.
+
+Fixed-width data types
+======================
+
+Before NumPy 2.0, the fixed-width `numpy.str_`, `numpy.bytes_`, and
+`numpy.void` data types were the only types available for working
+with strings and bytestrings in NumPy. For this reason, they are used
+as the default dtype for strings and bytestrings, respectively:
+
+   >>> np.array(["hello", "world"])
+   array(['hello', 'world'], dtype='<U5')
+
+Here the detected data type is ``'<U5'``, or little-endian unicode
+string data, with a maximum length of 5 unicode code points.
+
+Similarly for bytestrings:
+
+   >>> np.array([b"hello", b"world"])
+   array([b'hello', b'world'], dtype='|S5')
+
+Since this is a one-byte encoding, the byteorder is `'|'` (not
+applicable), and the data type detected is a maximum 5 character
+bytestring.
+
+You can also use `numpy.void` to represent bytestrings:
+
+   >>> np.array([b"hello", b"world"]).astype(np.void)
+   array([b'\x68\x65\x6C\x6C\x6F', b'\x77\x6F\x72\x6C\x64'], dtype='|V5')
+
+This is most useful when working with byte streams that are not well
+represented as bytestrings, and instead are better thought of as
+collections of 8-bit integers.
+
+.. _stringdtype:
+
+Variable-width strings
+======================
+
+.. versionadded:: 2.0
+
+.. note::
+
+   `numpy.dtypes.StringDType` is a new addition to NumPy, implemented
+   using the new support in NumPy for flexible user-defined data
+   types and is not as extensively tested in production workflows as
+   the older NumPy data types.
+
+Often, real-world string data does not have a predictable length. In
+these cases it is awkward to use fixed-width strings, since storing
+all the data without truncation requires knowing the length of the
+longest string one would like to store in the array before the array
+is created.
+
+To support situations like this, NumPy provides
+`numpy.dtypes.StringDType`, which stores variable-width string data
+in a UTF-8 encoding in a NumPy array:
+
+  >>> from numpy.dtypes import StringDType
+  >>> data = ["this is a longer string", "short string"]
+  >>> arr = np.array(data, dtype=StringDType())
+  >>> arr
+  array(['this is a longer string', 'short string'], dtype=StringDType())
+
+Note that unlike fixed-width strings, ``StringDType`` is not parameterized by
+the maximum length of an array element, arbitrarily long or short strings can
+live in the same array without needing to reserve storage for padding bytes in
+the short strings.
+
+Also note that unlike fixed-width strings and most other NumPy data
+types, ``StringDType`` does not store the string data in the "main"
+``ndarray`` data buffer. Instead, the array buffer is used to store
+metadata about where the string data are stored in memory. This
+difference means that code expecting the array buffer to contain
+string data will not function correctly, and will need to be updated
+to support ``StringDType``.
+
+Missing data support
+--------------------
+
+Often string datasets are not complete, and a special label is needed
+to indicate that a value is missing. By default ``StringDType`` does
+not have any special support for missing values, besides the fact
+that empty strings are used to populate empty arrays:
+
+  >>> np.empty(3, dtype=StringDType())
+  array(['', '', ''], dtype=StringDType())
+
+Optionally, you can pass create an instance of ``StringDType`` with
+support for missing values by passing ``na_object`` as a keyword
+argument for the initializer:
+
+  >>> dt = StringDType(na_object=None)
+  >>> arr = np.array(["this array has", None, "as an entry"], dtype=dt)
+  >>> arr
+  array(['this array has', None, 'as an entry'],
+        dtype=StringDType(na_object=None))
+  >>> arr[1] is None
+  True
+
+Th ``na_object`` can be any arbitrary python object.
+Common choices are `numpy.nan`, ``float('nan')``, ``None``, an object
+specifically intended to represent missing data like ``pandas.NA``,
+or a (hopefully) unique string like ``"__placeholder__"``.
+
+NumPy has special handling for NaN-like sentinels and string
+sentinels.
+
+NaN-like Missing Data Sentinels
++++++++++++++++++++++++++++++++
+
+A NaN-like sentinel returns itself as the result of arithmetic
+operations. This includes the python ``nan`` float and the Pandas
+missing data sentinel ``pd.NA``. NaN-like sentinels inherit these
+behaviors in string operations. This means that, for example, the
+result of addition with any other string is the sentinel:
+
+  >>> dt = StringDType(na_object=np.nan)
+  >>> arr = np.array(["hello", np.nan, "world"], dtype=dt)
+  >>> arr + arr
+  array(['hellohello', nan, 'worldworld'], dtype=StringDType(na_object=nan))
+
+Following the behavior of ``nan`` in float arrays, NaN-like sentinels
+sort to the end of the array:
+
+  >>> np.sort(arr)
+  array(['hello', 'world', nan], dtype=StringDType(na_object=nan))
+
+String Missing Data Sentinels
++++++++++++++++++++++++++++++
+
+A string missing data value is an instance of ``str`` or subtype of ``str``. If
+such an array is passed to a string operation or a cast, "missing" entries are
+treated as if they have a value given by the string sentinel. Comparison
+operations similarly use the sentinel value directly for missing entries.
+
+Other Sentinels
++++++++++++++++
+
+Other objects, such as ``None`` are also supported as missing data
+sentinels. If any missing data are present in an array using such a
+sentinel, then string operations will raise an error:
+
+  >>> dt = StringDType(na_object=None)
+  >>> arr = np.array(["this array has", None, "as an entry"])
+  >>> np.sort(arr)
+  Traceback (most recent call last):
+  ...
+  TypeError: '<' not supported between instances of 'NoneType' and 'str'
+
+Coercing Non-strings
+--------------------
+
+By default, non-string data are coerced to strings:
+
+  >>> np.array([1, object(), 3.4], dtype=StringDType())
+  array(['1', '<object object at 0x7faa2497dde0>', '3.4'], dtype=StringDType())
+
+If this behavior is not desired, an instance of the DType can be created that
+disables string coercion by setting ``coerce=False`` in the initalizer:
+
+  >>> np.array([1, object(), 3.4], dtype=StringDType(coerce=False))
+  Traceback (most recent call last):
+  ...
+  ValueError: StringDType only allows string data when string coercion is disabled.
+
+This allows strict data validation in the same pass over the data NumPy uses to
+create the array. Setting ``coerce=True`` recovers the default behavior allowing
+coercion to strings.
+
+Casting To and From Fixed-Width Strings
+---------------------------------------
+
+``StringDType`` supports round-trip casts between `numpy.str_`,
+`numpy.bytes_`, and `numpy.void`. Casting to a fixed-width string is
+most useful when strings need to be memory-mapped in an ndarray or
+when a fixed-width string is needed for reading and writing to a
+columnar data format with a known maximum string length.
+
+In all cases, casting to a fixed-width string requires specifying the
+maximum allowed string length:
+
+   >>> arr = np.array(["hello", "world"], dtype=StringDType())
+   >>> arr.astype(np.str_)  # doctest: +IGNORE_EXCEPTION_DETAIL
+   Traceback (most recent call last):
+   ...
+   TypeError: Casting from StringDType to a fixed-width dtype with an
+   unspecified size is not currently supported, specify an explicit
+   size for the output dtype instead.
+
+   The above exception was the direct cause of the following
+   exception:
+
+   TypeError: cannot cast dtype StringDType() to <class 'numpy.dtypes.StrDType'>.
+   >>> arr.astype("U5")
+   array(['hello', 'world'], dtype='<U5')
+
+The `numpy.bytes_` cast is most useful for string data that is known
+to contain only ASCII characters, as characters outside this range
+cannot be represented in a single byte in the UTF-8 encoding and are
+rejected.
+
+Any valid unicode string can be cast to `numpy.str_`, although
+since `numpy.str_` uses a 32-bit UCS4 encoding for all characters,
+this will often waste memory for real-world textual data that can be
+well-represented by a more memory-efficient encoding.
+
+Additionally, any valid unicode string can be cast to `numpy.void`,
+storing the UTF-8 bytes directly in the output array:
+
+  >>> arr = np.array(["hello", "world"], dtype=StringDType())
+  >>> arr.astype("V5")
+  array([b'\x68\x65\x6C\x6C\x6F', b'\x77\x6F\x72\x6C\x64'], dtype='|V5')
+
+Care must be taken to ensure that the output array has enough space
+for the UTF-8 bytes in the string, since the size of a UTF-8
+bytestream in bytes is not necessarily the same as the number of
+characters in the string.