Skip to content

Commit

Permalink
ARROW-6637: [C++] Further streamline default build, add ARROW_CSV CMa…
Browse files Browse the repository at this point in the history
…ke option

* Default for ARROW_COMPUTE/DATASET/FILESYSTEM/JSON set to OFF
* Add ARROW_CSV option, set to OFF by default (I could be swayed about this, but I think it is good to not create the perception that users are forced to build a module they don't need)
* Disable unit tests that don't build if ARROW_COMPUTE=OFF

The minimal Docker build now outputs

https://gist.github.com/wesm/ff70f46f5bc256d6b1d7b979aab8e5d0

The minimal build time on my 8-core laptop is reduced from 72 seconds to 25 seconds. Here is the log for the default build (with jemalloc turned off) without this patch

https://gist.github.com/wesm/fb5300a1989420158b0148e1c49ab6d6

Note that the default for ARROW_JEMALLOC is left to ON. I agree based on other discussion that having it off by default might cause developers doing performance testing to draw the wrong conclusions due to worse performance from the system allocator.

TODO

- [ ] Check conda
- [ ] Check Homebrew
- [ ] Check wheels
- [ ] Check Linux packages

Closes apache#5890 from wesm/more-minimal-default-build and squashes the following commits:

3fde8c4 <Wes McKinney> Do not pass ARROW_HDFS twice in ci/PKGBUILD
b0de549 <Wes McKinney> Fix PKGBUILD, add documentation to developers/cpp.rst, fix manylinux wheel READMEs
99640e1 <Wes McKinney> Many modules are implied by ARROW_PYTHON=ON. Requires HDFS when building Python
93b8bca <Wes McKinney> Explicitly build optional modules in MSVC Appveyor script
439af40 <Wes McKinney> Enable various modules when ARROW_PYTHON=ON
e3058a8 <Wes McKinney> ARROW_COMPUTE required for arrow_python and parquet
48c16f0 <Wes McKinney> Build uriparser unconditionally as not worth the trouble to conditionally disable
cd381aa <Wes McKinney> Provide for blanket env variable for minimal build script
9cbeabf <Wes McKinney> More minimal default options, add ARROW_CSV CMake option

Authored-by: Wes McKinney <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
  • Loading branch information
wesm authored and kou committed Dec 5, 2019
1 parent 10ba282 commit d6caca3
Show file tree
Hide file tree
Showing 11 changed files with 64 additions and 30 deletions.
7 changes: 6 additions & 1 deletion ci/PKGBUILD
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ depends=("${MINGW_PACKAGE_PREFIX}-boost"
"${MINGW_PACKAGE_PREFIX}-thrift"
"${MINGW_PACKAGE_PREFIX}-snappy"
"${MINGW_PACKAGE_PREFIX}-zlib"
"${MINGW_PACKAGE_PREFIX}-lz4"
"${MINGW_PACKAGE_PREFIX}-lz4"
"${MINGW_PACKAGE_PREFIX}-zstd")
makedepends=("${MINGW_PACKAGE_PREFIX}-cmake"
"${MINGW_PACKAGE_PREFIX}-gcc")
Expand Down Expand Up @@ -80,6 +80,11 @@ build() {
-DCMAKE_BUILD_TYPE=${cmake_build_type} \
-DARROW_BUILD_STATIC=ON \
-DARROW_BUILD_SHARED=OFF \
-DARROW_COMPUTE=ON \
-DARROW_CSV=ON \
-DARROW_DATASET=ON \
-DARROW_FILESYSTEM=ON \
-DARROW_JSON=ON \
-DARROW_PARQUET=ON \
-DARROW_HDFS=OFF \
-DARROW_BOOST_USE_SHARED=OFF \
Expand Down
2 changes: 2 additions & 0 deletions ci/scripts/cpp_build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ cmake -G "${CMAKE_GENERATOR:-Ninja}" \
-DARROW_BUILD_TESTS=${ARROW_BUILD_TESTS:-OFF} \
-DARROW_BUILD_UTILITIES=${ARROW_BUILD_UTILITIES:-ON} \
-DARROW_COMPUTE=${ARROW_COMPUTE:-ON} \
-DARROW_CSV=${ARROW_CSV:-ON} \
-DARROW_CUDA=${ARROW_CUDA:-OFF} \
-DARROW_CXXFLAGS=${ARROW_CXXFLAGS:-} \
-DARROW_DATASET=${ARROW_DATASET:-ON} \
Expand All @@ -60,6 +61,7 @@ cmake -G "${CMAKE_GENERATOR:-Ninja}" \
-DARROW_FUZZING=${ARROW_FUZZING:-OFF} \
-DARROW_FUZZING=${ARROW_FUZZING:-OFF} \
-DARROW_JNI=${ARROW_JNI:-OFF} \
-DARROW_JSON=${ARROW_JSON:-ON} \
-DARROW_GANDIVA_JAVA=${ARROW_GANDIVA_JAVA:-OFF} \
-DARROW_GANDIVA_PC_CXX_FLAGS=${ARROW_GANDIVA_PC_CXX_FLAGS:-} \
-DARROW_GANDIVA=${ARROW_GANDIVA:-OFF} \
Expand Down
15 changes: 14 additions & 1 deletion cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -266,7 +266,7 @@ endif(UNIX)
# Set up various options
#

if(ARROW_BUILD_TESTS OR ARROW_BUILD_INTEGRATION)
if(ARROW_BUILD_BENCHMARKS OR ARROW_BUILD_TESTS OR ARROW_BUILD_INTEGRATION)
set(ARROW_JSON ON)
endif()

Expand All @@ -278,6 +278,19 @@ if(ARROW_DATASET)
set(ARROW_FILESYSTEM ON)
endif()

if(ARROW_PARQUET)
set(ARROW_COMPUTE ON)
endif()

if(ARROW_PYTHON)
set(ARROW_COMPUTE ON)
set(ARROW_CSV ON)
set(ARROW_DATASET ON)
set(ARROW_FILESYSTEM ON)
set(ARROW_HDFS ON)
set(ARROW_JSON ON)
endif()

if(MSVC)
# ORC doesn't build on windows
set(ARROW_ORC OFF)
Expand Down
12 changes: 7 additions & 5 deletions cpp/cmake_modules/DefineOptions.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -141,20 +141,22 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}")

define_option(ARROW_BUILD_UTILITIES "Build Arrow commandline utilities" OFF)

define_option(ARROW_COMPUTE "Build the Arrow Compute Modules" ON)
define_option(ARROW_COMPUTE "Build the Arrow Compute Modules" OFF)

define_option(ARROW_CSV "Build the Arrow CSV Parser Module" OFF)

define_option(ARROW_CUDA "Build the Arrow CUDA extensions (requires CUDA toolkit)" OFF)

define_option(ARROW_DATASET "Build the Arrow Dataset Modules" ON)
define_option(ARROW_DATASET "Build the Arrow Dataset Modules" OFF)

define_option(ARROW_FILESYSTEM "Build the Arrow Filesystem Layer" ON)
define_option(ARROW_FILESYSTEM "Build the Arrow Filesystem Layer" OFF)

define_option(ARROW_FLIGHT
"Build the Arrow Flight RPC System (requires GRPC, Protocol Buffers)" OFF)

define_option(ARROW_GANDIVA "Build the Gandiva libraries" OFF)

define_option(ARROW_HDFS "Build the Arrow HDFS bridge" ON)
define_option(ARROW_HDFS "Build the Arrow HDFS bridge" OFF)

define_option(ARROW_HIVESERVER2 "Build the HiveServer2 client and Arrow adapter" OFF)

Expand All @@ -164,7 +166,7 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}")

define_option(ARROW_JNI "Build the Arrow JNI lib" OFF)

define_option(ARROW_JSON "Build Arrow with JSON support (requires RapidJSON)" ON)
define_option(ARROW_JSON "Build Arrow with JSON support (requires RapidJSON)" OFF)

define_option(ARROW_MIMALLOC "Build the Arrow mimalloc-based allocator" OFF)

Expand Down
10 changes: 1 addition & 9 deletions cpp/examples/minimal_build/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,15 +27,7 @@ NPROC=$(nproc)
mkdir $BUILD_DIR
pushd $BUILD_DIR

cmake /arrow/cpp \
-DARROW_COMPUTE=OFF \
-DARROW_DATASET=OFF \
-DARROW_FILESYSTEM=OFF \
-DARROW_HDFS=OFF \
-DARROW_JEMALLOC=OFF \
-DARROW_JSON=OFF \
-DARROW_USE_GLOG=OFF \
-DARROW_BUILD_UTILITIES=OFF
cmake /arrow/cpp -DARROW_JEMALLOC=OFF $ARROW_CMAKE_OPTIONS

make -j$NPROC
make install
Expand Down
24 changes: 16 additions & 8 deletions cpp/src/arrow/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -109,12 +109,6 @@ set(ARROW_SRCS
tensor.cc
type.cc
visitor.cc
csv/converter.cc
csv/chunker.cc
csv/column_builder.cc
csv/options.cc
csv/parser.cc
csv/reader.cc
io/buffered.cc
io/compressed.cc
io/file.cc
Expand Down Expand Up @@ -253,11 +247,21 @@ add_subdirectory(testing)
#

add_subdirectory(array)
add_subdirectory(csv)
add_subdirectory(io)
add_subdirectory(util)
add_subdirectory(vendored)

if(ARROW_CSV)
list(APPEND ARROW_SRCS
csv/converter.cc
csv/chunker.cc
csv/column_builder.cc
csv/options.cc
csv/parser.cc
csv/reader.cc)
add_subdirectory(csv)
endif()

if(ARROW_COMPUTE)
add_subdirectory(compute)
list(APPEND ARROW_SRCS
Expand Down Expand Up @@ -489,12 +493,16 @@ add_arrow_test(public_api_test)
add_arrow_test(result_test)
add_arrow_test(scalar_test)
add_arrow_test(status_test)
add_arrow_test(stl_test)
add_arrow_test(type_test)
add_arrow_test(table_test)
add_arrow_test(table_builder_test)
add_arrow_test(tensor_test)
add_arrow_test(sparse_tensor_test)

if(ARROW_COMPUTE)
# This unit test uses compute code
add_arrow_test(stl_test)
endif()

add_arrow_benchmark(builder_benchmark)
add_arrow_benchmark(type_benchmark)
6 changes: 5 additions & 1 deletion cpp/src/arrow/array/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,11 @@
# under the License.

add_arrow_test(concatenate_test)
add_arrow_test(diff_test)

if(ARROW_COMPUTE)
# This unit test uses compute code
add_arrow_test(diff_test)
endif()

# Headers: top level
arrow_install_all_headers("arrow/array")
1 change: 1 addition & 0 deletions dev/tasks/python-wheels/osx-build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ function build_wheel {
-DARROW_DEPENDENCY_SOURCE=BUNDLED \
-DARROW_FLIGHT=ON \
-DARROW_GANDIVA=${BUILD_ARROW_GANDIVA} \
-DARROW_BOOST_USE_SHARED=ON \
-DARROW_JEMALLOC=ON \
-DARROW_ORC=OFF \
-DARROW_PARQUET=ON \
Expand Down
13 changes: 10 additions & 3 deletions docs/source/developers/cpp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -152,9 +152,14 @@ By default, the C++ build system creates a fairly minimal build. We have
several optional system components which you can opt into building by passing
boolean flags to ``cmake``.

* ``-DARROW_COMPUTE=ON``: Computational kernel functions and other support
* ``-DARROW_CSV=ON``: CSV reader module
* ``-DARROW_CUDA=ON``: CUDA integration for GPU development. Depends on NVIDIA
CUDA toolkit. The CUDA toolchain used to build the library can be customized
by using the ``$CUDA_HOME`` environment variable.
* ``-DARROW_DATASET=ON``: Dataset API, implies the Filesystem API
* ``-DARROW_FILESYSTEM=ON``: Filesystem API for accessing local and remote
filesystems
* ``-DARROW_FLIGHT=ON``: Arrow Flight RPC system, which depends at least on
gRPC
* ``-DARROW_GANDIVA=ON``: Gandiva expression compiler, depends on LLVM,
Expand All @@ -163,14 +168,17 @@ boolean flags to ``cmake``.
* ``-DARROW_HDFS=ON``: Arrow integration with libhdfs for accessing the Hadoop
Filesystem
* ``-DARROW_HIVESERVER2=ON``: Client library for HiveServer2 database protocol
* ``-DARROW_JSON=ON``: JSON reader module
* ``-DARROW_ORC=ON``: Arrow integration with Apache ORC
* ``-DARROW_PARQUET=ON``: Apache Parquet libraries and Arrow integration
* ``-DARROW_PLASMA=ON``: Plasma Shared Memory Object Store
* ``-DARROW_PLASMA_JAVA_CLIENT=ON``: Build Java client for Plasma
* ``-DARROW_PYTHON=ON``: Arrow Python C++ integration library (required for
building pyarrow). This library must be built against the same Python version
for which you are building pyarrow, e.g. Python 2.7 or Python 3.6. NumPy must
also be installed.
for which you are building pyarrow. NumPy must also be installed. Enabling
this option also enables ``ARROW_COMPUTE``, ``ARROW_CSV``, ``ARROW_DATASET``,
``ARROW_FILESYSTEM``, ``ARROW_HDFS``, and ``ARROW_JSON``.
* ``-DARROW_S3=ON``: Support for Amazon S3-compatible filesystems
* ``-DARROW_WITH_BZ2=ON``: Build support for BZ2 compression
* ``-DARROW_WITH_ZLIB=ON``: Build suport for zlib (gzip) compression
* ``-DARROW_WITH_LZ4=ON``: Build suport for lz4 compression
Expand All @@ -181,7 +189,6 @@ boolean flags to ``cmake``.
Some features of the core Arrow shared library can be switched off for improved
build times if they are not required for your application:

* ``-DARROW_COMPUTE=ON``: build the in-memory analytics module
* ``-DARROW_IPC=ON``: build the IPC extensions

CMake version requirements
Expand Down
2 changes: 1 addition & 1 deletion python/manylinux1/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ use `PYTHON_VERSION="2.7"` with `UNICODE_WIDTH=32`):

```bash
# Build the python packages
docker-compose run -e PYTHON_VERSION="2.7" -e UNICODE_WIDTH=16 python-manylinux1
docker-compose run -e PYTHON_VERSION="2.7" -e UNICODE_WIDTH=16 centos-python-manylinux1
# Now the new packages are located in the dist/ folder
ls -l dist/
```
Expand Down
2 changes: 1 addition & 1 deletion python/manylinux2010/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ use `PYTHON_VERSION="2.7"` with `UNICODE_WIDTH=32`):

```bash
# Build the python packages
docker-compose run -e PYTHON_VERSION="2.7" -e UNICODE_WIDTH=16 python-manylinux2010
docker-compose run -e PYTHON_VERSION="2.7" -e UNICODE_WIDTH=16 centos-python-manylinux2010
# Now the new packages are located in the dist/ folder
ls -l dist/
```
Expand Down

0 comments on commit d6caca3

Please sign in to comment.