Skip to content

Commit

Permalink
ARROW-10203: [Doc] Give guidance on big-endian support in the contrib…
Browse files Browse the repository at this point in the history
…utors docs

@kiszk @jacques-n @wesm @pitrou @BryanCutler @nealrichardson  this capture my understanding of the mailing list conversation on endianness.  Please let me know if I've mischaracterized anything (I'll do a proof reading/compiling round once as long as the general points are agreed upon).

Closes apache#8374 from emkornfield/update_contributor_guidelines

Lead-authored-by: Micah Kornfield <[email protected]>
Co-authored-by: emkornfield <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
  • Loading branch information
3 people authored and kszucs committed Oct 19, 2020
1 parent 457935e commit 7944265
Show file tree
Hide file tree
Showing 3 changed files with 50 additions and 18 deletions.
48 changes: 48 additions & 0 deletions docs/source/developers/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -304,3 +304,51 @@ to your branch, which they sometimes do to help move a pull request along.
In addition, the GitHub PR "suggestion" feature can also add commits to
your branch, so it is possible that your local copy of your branch is missing
some additions.

Guidance for specific features
==============================

From time to time the community has discussions on specific types of features
and improvements that they expect to support. This section outlines decisions
that have been made in this regard.

Endianess
+++++++++

The Arrow format allows setting endianness. Due to the popularity of
little endian architectures most of implementation assume little endian by
default. There has been some effort to support big endian platforms as well.
Based on a `mailing-list discussion
<https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%3cCAK7Z5T--HHhr9Dy43PYhD6m-XoU4qoGwQVLwZsG-kOxXjPTyZA@mail.gmail.com%3e>`__,
the requirements for a new platform are:

1. A robust (non-flaky, returning results in a reasonable time) Continuous
Integration setup.
2. Benchmarks for performance critical parts of the code to demonstrate
no regression.

Furthermore, for big-endian support, there are two levels that an
implementation can support:

1. Native endianness (all Arrow communication happens with processes of the
same endianness). This includes ancillary functionality such as reading
and writing various file formats, such as Parquet.
2. Cross endian support (implementations will do byte reordering when
appropriate for :ref:`IPC <format-ipc>` and :ref:`Flight <flight-rpc>`
messages).

The decision on what level to support is based on maintainers' preferences for
complexity and technical risk. In general all implementations should be open
to native endianness support (provided the CI and performance requirements
are met). Cross endianness support is a question for individual maintainers.

The current implementations aiming for cross endian support are:

1. C++

Implementations that do not intend to implement cross endian support:

1. Java

For other libraries, a discussion to gather consensus on the mailing-list
should be had before submitting PRs.
2 changes: 2 additions & 0 deletions docs/source/format/Columnar.rst
Original file line number Diff line number Diff line change
Expand Up @@ -787,6 +787,8 @@ layouts depending on the particular realization of the type.
We do not go into detail about the logical types definitions in this
document as we consider `Schema.fbs`_ to be authoritative.

.. _format-ipc:

Serialization and Interprocess Communication (IPC)
==================================================

Expand Down
18 changes: 0 additions & 18 deletions docs/source/python/ipc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -330,21 +330,3 @@ An object can be reconstructed from its component-based representation using
``deserialize_components`` is also available as a method on
``SerializationContext`` objects.
Serializing pandas Objects
~~~~~~~~~~~~~~~~~~~~~~~~~~
The default serialization context has optimized handling of pandas
objects like ``DataFrame`` and ``Series``. Combined with component-based
serialization above, this enables zero-copy transport of pandas DataFrame
objects not containing any Python objects:
.. ipython:: python
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 5]})
context = pa.default_serialization_context()
serialized_df = context.serialize(df)
df_components = serialized_df.to_components()
original_df = context.deserialize_components(df_components)
original_df

0 comments on commit 7944265

Please sign in to comment.