The following guides explain the fundamental data structures used in the Java implementation of Apache Arrow.
- ValueVector is an abstraction that is used to store a sequence of values having the same type in an individual column.
- VectorSchemaRoot is a container that can hold multiple vectors based on a schema.
- The Reading/Writing IPC formats guide explains how to stream record batches as well as serializing record batches to files.
Generated javadoc documentation is available here.
Refer to Building Apache Arrow for documentation of environment setup and build instructions.
Arrow uses Google's Flatbuffers to transport metadata. The java version of the library
requires the generated flatbuffer classes can only be used with the same version that
generated them. Arrow packages a version of the arrow-vector module that shades flatbuffers
and arrow-format into a single JAR. Using the classifier "shade-format-flatbuffers" in your
pom.xml
will make use of this JAR, you can then exclude/resolve the original dependency to
a version of your choosing.
- Verify that your version of flatc matches the declared dependency:
$ flatc --version
flatc version 23.5.26
$ grep "dep.fbs.version" java/pom.xml
<dep.fbs.version>23.5.26</dep.fbs.version>
- Generate the flatbuffer java files by performing the following:
cd $ARROW_HOME
# remove the existing files
rm -rf java/format/src
# regenerate from the .fbs files
flatc --java -o java/format/src/main/java format/*.fbs
# prepend license header
find java/format/src -type f | while read file; do
(cat header | while read line; do echo "// $line"; done; cat $file) > $file.tmp
mv $file.tmp $file
done
There are several system/environmental variables that users can configure. These trade off safety (they turn off checking) for speed. Typically they are only used in production settings after the code has been thoroughly tested without using them.
-
Bounds Checking for memory accesses: Bounds checking is on by default. You can disable it by setting either the system property(
arrow.enable_unsafe_memory_access
) or the environmental variable (ARROW_ENABLE_UNSAFE_MEMORY_ACCESS
) totrue
. When both the system property and the environmental variable are set, the system property takes precedence. -
null checking for gets:
ValueVector
get methods (notgetObject
) methods by default verify the slot is not null. You can disable it by setting either the system property(arrow.enable_null_check_for_get
) or the environmental variable (ARROW_ENABLE_NULL_CHECK_FOR_GET
) tofalse
. When both the system property and the environmental variable are set, the system property takes precedence.
- For Java 9 or later, should set
-Dio.netty.tryReflectionSetAccessible=true
. This fixesjava.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available
. thrown by Netty. - To support duplicate fields in a
StructVector
enable-Darrow.struct.conflict.policy=CONFLICT_APPEND
. Duplicate fields are ignored (CONFLICT_REPLACE
) by default and overwritten. To support different policies for conflicting or duplicate fields set this JVM flag or use the correct static constructor methods forStructVector
s.
Arrow Java follows the Google style guide here with the following differences:
- Imports are grouped, from top to bottom, in this order: static imports, standard Java, org.*, com.*
- Line length can be up to 120 characters
- Operators for line wrapping are at end-of-line
- Naming rules for methods, parameters, etc. have been relaxed
- Disabled
NoFinalizer
,OverloadMethodsDeclarationOrder
, andVariableDeclarationUsageDistance
due to the existing code base. These rules should be followed when possible.
Refer to checkstyle.xml for rule specifics.
When running tests, Arrow Java uses the Logback logger with SLF4J. By default,
it uses the logback.xml
present in the corresponding module's src/test/resources
directory, which has the default log level set to INFO
.
Arrow Java can be built with an alternate logback configuration file using the
following command run in the project root directory:
mvn -Dlogback.configurationFile=file:<path-of-logback-file>
See Logback Configuration for more details.
Integration tests which require more time or more memory can be run by activating
the integration-tests
profile. This activates the maven failsafe plugin
and any class prefixed with IT
will be run during the testing phase. The integration
tests currently require a larger amount of memory (>4GB) and time to complete. To activate
the profile:
mvn -Pintegration-tests <rest of mvn arguments>