Skip to content

Commit

Permalink
ARROW-8222: [C++] Use bcp to make a slim boost for bundled build
Browse files Browse the repository at this point in the history
This patch switches our bundled boost ep to use a slimmer version of boost, which was built with the script added at `cpp/build_support/trim-boost.sh`. It uses the official boostorg big tarball as a fallback URL if for some reason ours is unavailable (as we've seen with other ep's, sometimes the download hosts are subject to rate limiting or other downtime, so having redundancy would more generally be good).

The resulting tarball is 10mb, much less than 113mb of the full boost but larger than the 800k I suggested in the ticket. This is because in order to build regex and filesystem, we need `config build boost_install headers`, which add some weight. Boost.Build also seems to require `log`, and `predef` is needed by log apparently.

In addition to slimming the boost tarball, this patch also refines the cmake logic that determines when boost is required. Due to recent-ish efforts to reduce usage of boost, we only need boost for tests and Gandiva, and for Parquet but only on gcc < 4.9. When building thrift_ep, we also need boost. By narrowing the condition involving Parquet to just gcc < 4.9, we are able to remove boost entirely from the R macOS and Windows packages.

Outstanding questions/to-dos that I'm aware of:

* [x] Put the boost bundle somewhere official. I plan to put it at https://dl.bintray.com/ursalabs/arrow-boost/ but am open to suggestion if someone has a better idea.
* [x] ~~Add a crossbow job to build the boost bundle: to update the boost version or change what's included in the bundle, edit cpp/build_support/trim-boost.sh and run an on-demand build via PR comment.~~ On further reflection, crossbow isn't appropriate because it would mean that anyone could edit the script and overwrite the bundle that gets pulled into all source builds. Instead, I added a script that one can run locally, just requiring a bintray user and token in the ursalabs organization (available to any committer on request).
* [ ] Something with [namespacing](https://issues.apache.org/jira/browse/ARROW-4286)? I'm not sure how much of a concern this is since Arrow itself doesn't really require boost anymore.

Closes apache#6734 from nealrichardson/bcp

Lead-authored-by: Neal Richardson <[email protected]>
Co-authored-by: François Saint-Jacques <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Wes McKinney <[email protected]>
  • Loading branch information
3 people authored and wesm committed Mar 30, 2020
1 parent 2a33338 commit 9621719
Show file tree
Hide file tree
Showing 21 changed files with 197 additions and 63 deletions.
4 changes: 1 addition & 3 deletions ci/scripts/PKGBUILD
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,7 @@ pkgdesc="Apache Arrow is a cross-language development platform for in-memory dat
arch=("any")
url="https://arrow.apache.org/"
license=("Apache-2.0")
depends=("${MINGW_PACKAGE_PREFIX}-boost"
"${MINGW_PACKAGE_PREFIX}-thrift"
depends=("${MINGW_PACKAGE_PREFIX}-thrift"
"${MINGW_PACKAGE_PREFIX}-snappy"
"${MINGW_PACKAGE_PREFIX}-zlib"
"${MINGW_PACKAGE_PREFIX}-lz4"
Expand Down Expand Up @@ -75,7 +74,6 @@ build() {
${MINGW_PREFIX}/bin/cmake.exe \
${ARROW_CPP_DIR} \
-G "MSYS Makefiles" \
-DARROW_BOOST_USE_SHARED=OFF \
-DARROW_BUILD_SHARED=OFF \
-DARROW_BUILD_STATIC=ON \
-DARROW_BUILD_UTILITIES=OFF \
Expand Down
1 change: 0 additions & 1 deletion ci/scripts/r_windows_build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,6 @@ export PKG_CONFIG="/${MINGW_PREFIX}/bin/pkg-config --static"

cp $ARROW_HOME/ci/scripts/PKGBUILD .
export PKGEXT='.pkg.tar.xz' # pacman default changed to .zst in 2020, but keep the old ext for compat
unset BOOST_ROOT
printenv
makepkg-mingw --noconfirm --noprogressbar --skippgpcheck --nocheck --syncdeps --cleanbuild

Expand Down
62 changes: 62 additions & 0 deletions cpp/build-support/trim-boost.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#

# This script is used to make the subset of boost that we actually use,
# so that we don't have to download the whole big boost project when we build
# boost from source.
#
# After running this script, run upload-boost.sh to put the bundle on bintray

set -eu

# if version is not defined by the caller, set a default.
: ${BOOST_VERSION:=1.71.0}
: ${BOOST_FILE:=boost_${BOOST_VERSION//./_}}
: ${BOOST_URL:=https://dl.bintray.com/boostorg/release/${BOOST_VERSION}/source/${BOOST_FILE}.tar.gz}

# Arrow tests require these
BOOST_LIBS="system.hpp filesystem.hpp"
# Add these to be able to build those
BOOST_LIBS="$BOOST_LIBS config build boost_install headers log predef"
# Parquet needs this (if using gcc < 4.9)
BOOST_LIBS="$BOOST_LIBS regex.hpp"
# Gandiva needs these
BOOST_LIBS="$BOOST_LIBS functional/hash.hpp multiprecision/cpp_int.hpp"
# These are for Thrift when Thrift_SOURCE=BUNDLED
BOOST_LIBS="$BOOST_LIBS algorithm/string.hpp locale.hpp noncopyable.hpp numeric/conversion/cast.hpp scope_exit.hpp scoped_array.hpp shared_array.hpp tokenizer.hpp version.hpp"

if [ ! -d ${BOOST_FILE} ]; then
curl -L "${BOOST_URL}" > ${BOOST_FILE}.tar.gz
tar -xzf ${BOOST_FILE}.tar.gz
fi

pushd ${BOOST_FILE}

if [ ! -f "dist/bin/bcp" ]; then
./bootstrap.sh
./b2 tools/bcp
fi
mkdir -p ${BOOST_FILE}
./dist/bin/bcp ${BOOST_LIBS} ${BOOST_FILE}

tar -czf ${BOOST_FILE}.tar.gz ${BOOST_FILE}/
# Resulting tarball is in ${BOOST_FILE}/${BOOST_FILE}.tar.gz

popd
54 changes: 54 additions & 0 deletions cpp/build-support/upload-boost.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#

# This assumes you've just run cpp/build-support/trim-boost.sh, so the file
# to upload is at ${BOOST_FILE}/${BOOST_FILE}.tar.gz
#
# Also, you must have a bintray account on the "ursalabs" organization and
# set the BINTRAY_USER and BINTRAY_APIKEY env vars.

set -eu

# if version is not defined by the caller, set a default.
: ${BOOST_VERSION:=1.71.0}
: ${BOOST_FILE:=boost_${BOOST_VERSION//./_}}
: ${DST_URL:=https://api.bintray.com/content/ursalabs/arrow-boost/arrow-boost/latest}

if [ "$BINTRAY_USER" = "" ]; then
echo "Must set BINTRAY_USER"
exit 1
fi
if [ "$BINTRAY_APIKEY" = "" ]; then
echo "Must set BINTRAY_APIKEY"
exit 1
fi

upload_file() {
if [ -f "$1" ]; then
echo "PUT ${DST_URL}/$1?override=1&publish=1"
curl -sS -u "${BINTRAY_USER}:${BINTRAY_APIKEY}" -X PUT "${DST_URL}/$1?override=1&publish=1" --data-binary "@$1"
else
echo "$1 not found"
fi
}

pushd ${BOOST_FILE}
upload_file ${BOOST_FILE}.tar.gz
popd
39 changes: 35 additions & 4 deletions cpp/cmake_modules/ThirdpartyToolchain.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -279,10 +279,22 @@ if(DEFINED ENV{ARROW_BOOST_URL})
else()
string(REPLACE "." "_" ARROW_BOOST_BUILD_VERSION_UNDERSCORES
${ARROW_BOOST_BUILD_VERSION})
# This is the trimmed boost bundle we maintain.
# See cpp/build_support/trim-boost.sh
set(
BOOST_SOURCE_URL
"https://dl.bintray.com/boostorg/release/${ARROW_BOOST_BUILD_VERSION}/source/boost_${ARROW_BOOST_BUILD_VERSION_UNDERSCORES}.tar.gz"
"https://dl.bintray.com/ursalabs/arrow-boost/boost_${ARROW_BOOST_BUILD_VERSION_UNDERSCORES}.tar.gz"
)
if(NOT CMAKE_VERSION VERSION_LESS 3.7)
# Append as a backup URL the full source from boostorg
# Feature only available starting in 3.7
# (and VERSION_GREATER_EQUAL also only available starting in 3.7)
list(
APPEND
BOOST_SOURCE_URL
"https://dl.bintray.com/boostorg/release/${ARROW_BOOST_BUILD_VERSION}/source/boost_${ARROW_BOOST_BUILD_VERSION_UNDERSCORES}.tar.gz"
)
endif()
endif()

if(DEFINED ENV{ARROW_BROTLI_URL})
Expand Down Expand Up @@ -642,10 +654,25 @@ set(Boost_ADDITIONAL_VERSIONS
"1.60.0"
"1.60")

# Thrift needs Boost if we're building the bundled version,
# so we first need to determine whether we're building it
if(ARROW_WITH_THRIFT AND Thrift_SOURCE STREQUAL "AUTO")
find_package(Thrift 0.11.0 MODULE)
if(NOT Thrift_FOUND AND NOT THRIFT_FOUND)
set(Thrift_SOURCE "BUNDLED")
endif()
endif()

# - Parquet requires boost only with gcc 4.8 (because of missing std::regex).
# - Gandiva has a compile-time (header-only) dependency on Boost, not runtime.
# - Tests needs Boost at runtime.
if(ARROW_BUILD_INTEGRATION OR ARROW_BUILD_TESTS OR ARROW_GANDIVA OR ARROW_PARQUET)
if(ARROW_BUILD_INTEGRATION
OR ARROW_BUILD_TESTS
OR ARROW_GANDIVA
OR (ARROW_WITH_THRIFT AND Thrift_SOURCE STREQUAL "BUNDLED")
OR (ARROW_PARQUET
AND CMAKE_CXX_COMPILER_ID STREQUAL "GNU"
AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS "4.9"))
set(ARROW_BOOST_REQUIRED TRUE)
else()
set(ARROW_BOOST_REQUIRED FALSE)
Expand Down Expand Up @@ -1097,8 +1124,12 @@ macro(build_thrift)
endmacro()

if(ARROW_WITH_THRIFT)
# Thrift c++ code generated by 0.13 requires 0.11 or greater
resolve_dependency_with_version(Thrift 0.11.0)
# We already may have looked for Thrift earlier, when considering whether
# to build Boost, so don't look again if already found.
if(NOT Thrift_FOUND AND NOT THRIFT_FOUND)
# Thrift c++ code generated by 0.13 requires 0.11 or greater
resolve_dependency_with_version(Thrift 0.11.0)
endif()
# TODO: Don't use global includes but rather target_include_directories
include_directories(SYSTEM ${THRIFT_INCLUDE_DIR})
endif()
Expand Down
31 changes: 15 additions & 16 deletions cpp/src/arrow/stl_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,6 @@
#include <vector>

#include <gtest/gtest.h>
#include <boost/optional.hpp>
#include <boost/range/adaptor/transformed.hpp>

#include "arrow/memory_pool.h"
#include "arrow/stl.h"
Expand All @@ -34,17 +32,11 @@
#include "arrow/testing/gtest_util.h"
#include "arrow/type.h"
#include "arrow/type_fwd.h"
#include "arrow/util/optional.h"

using primitive_types_tuple = std::tuple<int8_t, int16_t, int32_t, int64_t, uint8_t,
uint16_t, uint32_t, uint64_t, bool, std::string>;

using boost_optional_types_tuple =
std::tuple<boost::optional<int8_t>, boost::optional<int16_t>,
boost::optional<int32_t>, boost::optional<int64_t>,
boost::optional<uint8_t>, boost::optional<uint16_t>,
boost::optional<uint32_t>, boost::optional<uint64_t>,
boost::optional<bool>, boost::optional<std::string>>;

using raw_pointer_optional_types_tuple =
std::tuple<int8_t*, int16_t*, int32_t*, int64_t*, uint8_t*, uint16_t*, uint32_t*,
uint64_t*, bool*, std::string*>;
Expand Down Expand Up @@ -108,6 +100,12 @@ struct TestInt32Type {

namespace arrow {

using optional_types_tuple =
std::tuple<util::optional<int8_t>, util::optional<int16_t>, util::optional<int32_t>,
util::optional<int64_t>, util::optional<uint8_t>, util::optional<uint16_t>,
util::optional<uint32_t>, util::optional<uint64_t>, util::optional<bool>,
util::optional<std::string>>;

template <>
struct CTypeTraits<CustomOptionalTypeMock> {
using ArrowType = ::arrow::StringType;
Expand Down Expand Up @@ -248,15 +246,15 @@ TEST(TestTableFromTupleVector, ListType) {
}

TEST(TestTableFromTupleVector, ReferenceTuple) {
using boost::adaptors::transform;

std::vector<std::string> names{"column1", "column2", "column3", "column4", "column5",
"column6", "column7", "column8", "column9", "column10"};
std::vector<CustomType> rows{
{-1, -2, -3, -4, 1, 2, 3, 4, true, std::string("Tests")},
{-10, -20, -30, -40, 10, 20, 30, 40, false, std::string("Other")}};
auto rng_rows =
transform(rows, [](const CustomType& c) -> decltype(c.tie()) { return c.tie(); });
std::vector<decltype(rows[0].tie())> rng_rows{
rows[0].tie(),
rows[1].tie(),
};
std::shared_ptr<Table> table;
ASSERT_OK(TableFromTupleRange(default_memory_pool(), rng_rows, names, &table));

Expand Down Expand Up @@ -289,12 +287,13 @@ TEST(TestTableFromTupleVector, ReferenceTuple) {
TEST(TestTableFromTupleVector, NullableTypesWithBoostOptional) {
std::vector<std::string> names{"column1", "column2", "column3", "column4", "column5",
"column6", "column7", "column8", "column9", "column10"};
using types_tuple = boost_optional_types_tuple;
using types_tuple = optional_types_tuple;
std::vector<types_tuple> rows{
types_tuple(-1, -2, -3, -4, 1, 2, 3, 4, true, std::string("Tests")),
types_tuple(-10, -20, -30, -40, 10, 20, 30, 40, false, std::string("Other")),
types_tuple(boost::none, boost::none, boost::none, boost::none, boost::none,
boost::none, boost::none, boost::none, boost::none, boost::none),
types_tuple(util::nullopt, util::nullopt, util::nullopt, util::nullopt,
util::nullopt, util::nullopt, util::nullopt, util::nullopt,
util::nullopt, util::nullopt),
};
std::shared_ptr<Table> table;
ASSERT_OK(TableFromTupleRange(default_memory_pool(), rows, names, &table));
Expand Down
12 changes: 4 additions & 8 deletions cpp/src/arrow/util/bit_util_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,6 @@

#include <gtest/gtest.h>

#include <boost/utility.hpp> // IWYU pragma: export

#include "arrow/buffer.h"
#include "arrow/memory_pool.h"
#include "arrow/testing/gtest_common.h"
Expand Down Expand Up @@ -920,12 +918,10 @@ TEST(BitUtil, CoveringBytes) {
}

TEST(BitUtil, TrailingBits) {
EXPECT_EQ(BitUtil::TrailingBits(BOOST_BINARY(1 1 1 1 1 1 1 1), 0), 0);
EXPECT_EQ(BitUtil::TrailingBits(BOOST_BINARY(1 1 1 1 1 1 1 1), 1), 1);
EXPECT_EQ(BitUtil::TrailingBits(BOOST_BINARY(1 1 1 1 1 1 1 1), 64),
BOOST_BINARY(1 1 1 1 1 1 1 1));
EXPECT_EQ(BitUtil::TrailingBits(BOOST_BINARY(1 1 1 1 1 1 1 1), 100),
BOOST_BINARY(1 1 1 1 1 1 1 1));
EXPECT_EQ(BitUtil::TrailingBits(0xFF, 0), 0);
EXPECT_EQ(BitUtil::TrailingBits(0xFF, 1), 1);
EXPECT_EQ(BitUtil::TrailingBits(0xFF, 64), 0xFF);
EXPECT_EQ(BitUtil::TrailingBits(0xFF, 100), 0xFF);
EXPECT_EQ(BitUtil::TrailingBits(0, 1), 0);
EXPECT_EQ(BitUtil::TrailingBits(0, 64), 0);
EXPECT_EQ(BitUtil::TrailingBits(1LL << 63, 0), 0);
Expand Down
20 changes: 9 additions & 11 deletions cpp/src/arrow/util/rle_encoding_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,6 @@

#include <gtest/gtest.h>

#include <boost/utility.hpp> // IWYU pragma: export

#include "arrow/array.h"
#include "arrow/buffer.h"
#include "arrow/testing/random.h"
Expand All @@ -47,11 +45,11 @@ TEST(BitArray, TestBool) {

// Write alternating 0's and 1's
for (int i = 0; i < 8; ++i) {
bool result = writer.PutValue(i % 2, 1);
EXPECT_TRUE(result);
EXPECT_TRUE(writer.PutValue(i % 2, 1));
}
writer.Flush();
EXPECT_EQ((int)buffer[0], BOOST_BINARY(1 0 1 0 1 0 1 0));

EXPECT_EQ(buffer[0], 0xAA /* 0b10101010 */);

// Write 00110011
for (int i = 0; i < 8; ++i) {
Expand All @@ -72,8 +70,8 @@ TEST(BitArray, TestBool) {
writer.Flush();

// Validate the exact bit value
EXPECT_EQ((int)buffer[0], BOOST_BINARY(1 0 1 0 1 0 1 0));
EXPECT_EQ((int)buffer[1], BOOST_BINARY(1 1 0 0 1 1 0 0));
EXPECT_EQ(buffer[0], 0xAA /* 0b10101010 */);
EXPECT_EQ(buffer[1], 0xCC /* 0b11001100 */);

// Use the reader and validate
BitUtil::BitReader reader(buffer, len);
Expand Down Expand Up @@ -285,7 +283,7 @@ TEST(Rle, SpecificSequences) {
}

for (int width = 9; width <= MAX_WIDTH; ++width) {
ValidateRle(values, width, NULL,
ValidateRle(values, width, nullptr,
2 * (1 + static_cast<int>(BitUtil::CeilDiv(width, 8))));
}

Expand All @@ -296,16 +294,16 @@ TEST(Rle, SpecificSequences) {
int num_groups = static_cast<int>(BitUtil::CeilDiv(100, 8));
expected_buffer[0] = static_cast<uint8_t>((num_groups << 1) | 1);
for (int i = 1; i <= 100 / 8; ++i) {
expected_buffer[i] = BOOST_BINARY(1 0 1 0 1 0 1 0);
expected_buffer[i] = 0xAA /* 0b10101010 */;
}
// Values for the last 4 0 and 1's. The upper 4 bits should be padded to 0.
expected_buffer[100 / 8 + 1] = BOOST_BINARY(0 0 0 0 1 0 1 0);
expected_buffer[100 / 8 + 1] = 0x0A /* 0b00001010 */;

// num_groups and expected_buffer only valid for bit width = 1
ValidateRle(values, 1, expected_buffer, 1 + num_groups);
for (int width = 2; width <= MAX_WIDTH; ++width) {
int num_values = static_cast<int>(BitUtil::CeilDiv(100, 8)) * 8;
ValidateRle(values, width, NULL,
ValidateRle(values, width, nullptr,
1 + static_cast<int>(BitUtil::CeilDiv(width * num_values, 8)));
}
}
Expand Down
4 changes: 2 additions & 2 deletions cpp/src/gandiva/cache.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,11 @@ class Cache {
public:
explicit Cache(size_t capacity = CACHE_SIZE) : cache_(capacity) {}
ValueType GetModule(KeyType cache_key) {
boost::optional<ValueType> result;
arrow::util::optional<ValueType> result;
mtx_.lock();
result = cache_.get(cache_key);
mtx_.unlock();
return result != boost::none ? *result : nullptr;
return result != arrow::util::nullopt ? *result : nullptr;
}

void PutModule(KeyType cache_key, ValueType module) {
Expand Down
Loading

0 comments on commit 9621719

Please sign in to comment.