nanobind is a small binding library that exposes C++ types in Python and vice versa. It is reminiscent of Boost.Python and pybind11 and uses near-identical syntax. In contrast to these existing tools, nanobind is more efficient: bindings compile in a shorter amount of time, producing smaller binaries with better runtime performance.
I started the pybind11 project back in 2015 to generate better and more efficient C++/Python bindings. Thanks to many amazing contributions by others, pybind11 has become a core dependency of software across the world including flagship projects like PyTorch and Tensorflow. Every day, the repository is cloned more than 100.000 times. Hundreds of contributed extensions and generalizations address use cases of this diverse audience. However, all of this success also came with costs: the complexity of the library grew tremendously, which had a negative impact on efficiency.
Ironically, the situation today feels like 2015 all over again: binding generation with existing tools (Boost.Python, pybind11) is slow and produces enormous binaries with overheads on runtime performance. At the same time, key improvements in C++17 and Python 3.8 provide opportunities for drastic simplifications. Therefore, I am starting another binding project.. This time, the scope is intentionally limited so that this doesn't turn into an endless cycle.
TLDR: nanobind bindings compile ~2-3× faster, producing ~3× smaller binaries, with up to ~8× lower overheads on runtime performance (when comparing to pybind11 with
-Os
size optimizations).
The following experiments analyze the performance of a very large
function-heavy (func
) and class-heavy (class
) binding microbenchmark
compiled using Boost.Python, pybind11, and nanobind in both debug
and
size-optimized (opt
) modes.
A comparison with cppyy (which uses
dynamic compilation) is also shown later.
Details on the experimental setup can be found
here.
The first plot contrasts the compilation time, where "number ×" annotations denote the amount of time spent relative to nanobind. As shown below, nanobind achieves a consistent ~2-3× improvement compared to pybind11.
nanobind also greatly reduces the binary size of the compiled bindings. There is a roughly 3× improvement compared to pybind11 and a 8-9× improvement compared to Boost.Python (both with size optimizations).
The last experiment compares the runtime performance overheads by calling one of the bound functions many times in a loop. Here, it is also interesting to compare against cppyy (gray bar) and a pure Python implementation that runs bytecode without binding overheads (hatched red bar).
This data shows that the overhead of calling a nanobind function is lower than that of an equivalent function call done within CPython. The functions benchmarked here don't perform CPU-intensive work, so this this mainly measures the overheads of performing a function call, boxing/unboxing arguments and return values, etc.
The difference to pybind11 is significant: a ~2× improvement for simple functions, and an ~8× improvement when classes are being passed around. Complexities in pybind11 related to overload resolution, multiple inheritance, and holders are the main reasons for this difference. Those features were either simplified or completely removed in nanobind.
Finally, there is a ~1.4× improvement in both experiments compared
to cppyy (please ignore the two [debug]
columns—I did not feel
comfortable adjusting the JIT compilation flags; all cppyy bindings
are therefore optimized.)
cppyy is based on dynamic parsing of C++ code and just-in-time (JIT) compilation of bindings via the LLVM compiler infrastructure. The authors of cppyy report that their tool produces bindings with much lower overheads compared to pybind11, and the above plots show that this is indeed true. However, nanobind retakes the performance lead in these experiments.
With speed gone as the main differentiating factor, other qualitative differences make these two tools appropriate to different audiences: cppyy has its origin in CERN's ROOT mega-project and must be highly dynamic to work with that codebase: it can parse header files to generate bindings as needed. cppyy works particularly well together with PyPy and can avoid boxing/unboxing overheads with this combination. The main downside of cppyy is that it depends on big and complex machinery (Cling/Clang/LLVM) that must be deployed on the user's side and then run there. There isn't a way of pre-generating bindings and then shipping just the output of this process.
nanobind is relatively static in comparison: you must tell it which functions to expose via binding declarations. These declarations offer a high degree of flexibility that users will typically use to create bindings that feel pythonic. At compile-time, those declarations turn into a sequence of CPython API calls, which produces self-contained bindings that are easy to redistribute via PyPI or elsewhere. Tools like cibuildwheel and scikit-build can fully automate the process of generating Python wheels for each target platform. A minimal example project shows how to do this automatically via GitHub Actions.
nanobind and pybind11 are the most similar of all of the binding tools compared above.
The main difference is between them is a change in philosophy: pybind11 must deal with all of C++ to bind complex legacy codebases, while nanobind targets a smaller C++ subset. The codebase has to adapt to the binding tool and not the other way around!, which allows nanobind to be simpler and faster. Pull requests with extensions and generalizations were welcomed in pybind11, but they will likely be rejected in this project.
An overview of removed features is provided in a separate document. Besides feature removal, the rewrite was also an opportunity to address long-standing performance issues in pybind11:
- C++ objects are now co-located with the Python object whenever possible (less pointer chasing compared to pybind11). The per-instance overhead for wrapping a C++ type into a Python object shrinks by 2.3x. (pybind11: 56 bytes, nanobind: 24 bytes.)
- C++ function binding information is now co-located with the Python function object (less pointer chasing).
- C++ type binding information is now co-located with the Python type object (less pointer chasing, fewer hashtable lookups).
- nanobind internally replaces
std::unordered_map
with a more efficient hash table (tsl::robin_map, which is included as a git submodule). - function calls from/to Python are realized using PEP 590 vector calls, which gives a nice speed boost. The main function dispatch loop no longer allocates heap memory.
- pybind11 was designed as a header-only library, which is generally a good
thing because it simplifies the compilation workflow. However, one major
downside of this is that a large amount of redundant code has to be compiled
in each binding file (e.g., the function dispatch loop and all of the related
internal data structures). nanobind compiles a separate shared or static
support library (
libnanobind
) and links it against the binding code to avoid redundant compilation. When using the CMakenanobind_add_module()
function, this all happens transparently. #include <pybind11/pybind11.h>
pulls in a large portion of the STL (about 2.1 MiB of headers with Clang and libc++). nanobind minimizes STL usage to avoid this problem. Type casters even for for basic types likestd::string
require an explicit opt-in by including an extra header file (e.g.#include <nanobind/stl/string.h>
).- pybind11 is dependent on link time optimization (LTO) to produce reasonably-sized bindings, which makes linking a build time bottleneck. With nanobind's split into a precompiled core library and minimal metatemplating, LTO is no longer important.
- nanobind maintains efficient internal data structures for lifetime
management (needed for
nb::keep_alive
,nb::rv_policy::reference_internal
, thestd::shared_ptr
interface, etc.). With these changes, it is no longer necessary that bound types are weak-referenceable, which saves a pointer per instance.
Besides performance improvements, nanobind includes a quality-of-live improvements for developers:
-
When the python interpreter shuts down, nanobind reports instance, type, and function leaks related to bindings, which is useful for tracking down reference counting issues.
-
nanobind deletes its internal data structures when the Python interpreter terminates, which avoids memory leak reports in tools like valgrind.
-
In pybind11, function docstrings are pre-rendered while the binding code runs (
.def(...)
). This can create confusing signatures containing C++ types when the binding code of those C++ types hasn't yet run. nanobind does not pre-render function docstrings: they are created on the fly when queried. -
nanobind docstrings have improved out-of-the-box compatibility with tools like Sphinx.
-
nanobind has greatly improved support for exchanging tensor data structures with modern array programming frameworks.
nanobind depends on recent versions of everything:
-
C++17: The
if constexpr
feature was crucial to simplify the internal meta-templating of this library. -
Python 3.8+: nanobind heavily relies on PEP 590 vector calls that were introduced in version 3.8.
-
CMake 3.15+: Recent CMake versions include important improvements to
FindPython
that this project depends on. -
Supported compilers: Clang 7, GCC 8, MSVC2019 (or newer) are officially supported.
Other compilers like MinGW, Intel (icpc, oneAPI), NVIDIA (PGI, nvcc) may or may not work but aren't officially supported. Pull requests to work around bugs in these compilers will not be accepted, as similar changes introduced significant complexity in pybind11. Instead, please file bugs with the vendors so that they will fix their compilers.
nanobind integrates with CMake to simplify binding compilation. Please see the separate writeup for details.
The easiest way to get started is by cloning
nanobind_example
, which is a
minimal project with nanobind-based bindings compiled via CMake and
scikit-build
. It also shows
how to use GitHub Actions to deploy binary wheels for a variety of platforms.
nanobind mostly follows the pybind11 API, hence the pybind11 documentation is the main source of documentation for this project. A number of simplifications and noteworthy changes are detailed below.
-
Namespace. nanobind types and functions are located in the
nanobind
namespace. Thenamespace nb = nanobind;
shorthand alias is recommended. -
Macros. The
PYBIND11_*
macros (e.g.,PYBIND11_OVERRIDE(..)
) were renamed toNB_*
(e.g.,NB_OVERRIDE(..)
). -
Shared pointers and holders. nanobind removes the concept of a holder type, which caused inefficiencies and introduced complexity in pybind11. This has implications on object ownership, shared ownership, and interactions with C++ shared/unique pointers. Please see the following separate document for the nitty-gritty details.
The gist is that use of shared/unique pointers requires one or both of the following optional header files:
Binding functions that take
std::unique_ptr<T>
arguments involves some limitations that can be avoided by changing their signatures to usestd::unique_ptr<T, nb::deleter<T>>
instead. Usage ofstd::enable_shared_from_this<T>
is prohibited and will raise a compile-time assertion. This is consistent with the philosophy of this library: the codebase has to adapt to the binding tool and not the other way around.It is no longer necessary to specify holder types in the type declaration:
pybind11:
py::class_<MyType, std::shared_ptr<MyType>>(m, "MyType") ...
nanobind:
nb::class_<MyType>(m, "MyType") ...
-
Null pointers. In contrast to pybind11, nanobind by default does not permit
None
-valued arguments during overload resolution. They need to be enabled explicitly using the.none()
member of an argument annotation..def("func", &func, "arg"_a.none());
It is also possible to set a
None
default value as follows:.def("func", &func, "arg"_a.none() = nb::none());
-
Implicit type conversions. In pybind11, implicit conversions were specified using a follow-up function call. In nanobind, they are specified within the constructor declarations:
pybind11:
py::class_<MyType>(m, "MyType") .def(py::init<MyOtherType>()); py::implicitly_convertible<MyOtherType, MyType>();
nanobind:
nb::class_<MyType>(m, "MyType") .def(nb::init_implicit<MyOtherType>());
-
Custom constructors: In pybind11, custom constructors (i.e. ones that do not already exist in the C++ class) could be specified as lambda function returning an instance of the desired type.
nb::class_<MyType>(m, "MyType") .def(nb::init([](int) { return MyType(...); }));
Unfortunately, the implementation of this feature was quite complex and often required involved further internal calls to the move or copy constructor. nanobind instead reverts to how pybind11 originally implemented this feature using in-place construction ("placement new"):
nb::class_<MyType>(m, "MyType") .def("__init__", [](MyType *t) { new (t) MyType(...); });
The provided lambda function will be called with a pointer to uninitialized memory that has already been allocated (this memory region is co-located with the Python object for reasons of efficiency). The lambda function can then either run an in-place constructor and return normally (in which case the instance is assumed to be correctly constructed) or fail by raising an exception.
-
Trampoline classes. Trampolines, i.e., polymorphic class implementations that forward virtual function calls to Python, now require an extra
NB_TRAMPOLINE(parent, size)
declaration, whereparent
refers to the parent class andsize
is at least as big as the number ofNB_OVERRIDE_*()
calls. nanobind caches information to enable efficient function dispatch, for which it must know the number of trampoline "slots". Example:struct PyAnimal : Animal { NB_TRAMPOLINE(Animal, 1); std::string name() const override { NB_OVERRIDE(std::string, Animal, name); } };
Trampoline declarations with an insufficient size may eventually trigger a Python
RuntimeError
exception with a descriptive label, e.g.nanobind::detail::get_trampoline('PyAnimal::what()'): the trampoline ran out of slots (you will need to increase the value provided to the NB_TRAMPOLINE() macro)!
. -
Type casters. The API of custom type casters has changed significantly. In a nutshell, the following changes are needed:
-
load()
was renamed tofrom_python()
. The function now takes an extrauint8_t flags
(insteadbool convert
, which is now represented by the flagnanobind::detail::cast_flags::convert
). Acleanup_list *
pointer keeps track of Python temporaries that are created by the conversion, and which need to be deallocated after a function call has taken place.flags
andcleanup
should be passed to any recursive usage oftype_caster::from_python()
. -
cast()
was renamed tofrom_cpp()
. The function takes a return value policy (as before) and acleanup_list *
pointer.
Both functions must be marked as
noexcept
. In contrast to pybind11, errors during type casting are only propagated using status codes. If a severe error condition arises that should be reported, use Python warning API calls for this, e.g.PyErr_WarnFormat()
.Note that the cleanup list is only available when
from_python()
orfrom_cpp()
are called as part of function dispatch, while usage bynanobind::cast()
setscleanup
tonullptr
. This case should be handled gracefully by refusing the conversion if the cleanup list is absolutely required.The std::pair type caster may be useful as a reference for these changes.
-
-
The following types and functions were renamed:
pybind11 nanobind error_already_set
python_error
type::of<T>
type<T>
type
type_object
reinterpret_borrow
borrow
reinterpret_steal
steal
custom_type_setup
type_callback
-
New features.
-
Unified DLPack/Buffer protocol integration: nanobind can retrieve and return tensors using two standard protocols: DLPack, and the the buffer protocol. This enables zero-copy data exchange of CPU and GPU tensors with array programming frameworks including NumPy, PyTorch, TensorFlow, JAX, etc.
Details on using this feature can be found here.
-
Supplemental type data: nanobind can store supplemental data along with registered types. This information is co-located with the Python type object. An example use of this fairly advanced feature are libraries that register large numbers of different types (e.g. flavors of tensors). A single generically implemented function can then query this supplemental information to handle each type slightly differently.
struct Supplement { ... // should be a POD (plain old data) type }; // Register a new type Test, and reserve space for sizeof(Supplement) nb::class_<Test> cls(m, "Test", nb::supplement<Supplement>()) /// Mutable reference to 'Supplement' portion in Python type object Supplement &supplement = nb::type_supplement<Supplement>(cls);
-
Low-level interface: nanobind exposes a low-level interface to provide fine-grained control over the sequence of steps that instantiates a Python object wrapping a C++ instance. Like the above point, this is useful when writing generic binding code that manipulates nanobind-based objects of various types.
Details on using this feature can be found here.
-
Python type wrappers: The
nb::handle_t<T>
type behaves just like thenb::handle
class and wraps aPyObject *
pointer. However, when binding a function that takes such an argument, nanobind will only call the associated function overload when the underlying Python object wraps a C++ instance of typeT
.Siimlarly, the
nb::type_object_t<T>
type behaves just like thenb::type_object
class and wraps aPyTypeObject *
pointer. However, when binding a function that takes such an argument, nanobind will only call the associated function overload when the underlying Python type object is a subtype of the C++ typeT
. -
Raw docstrings: In cases where absolute control over docstrings is required (for example, so that complex cases can be parsed by a tool like Sphinx), the
nb::raw_doc
attribute can be specified to functions. In this case, nanobind will skip generation of a combined docstring that enumerates overloads along with type information.Example:
m.def("identity", [](float arg) { return arg; }); m.def("identity", [](int arg) { return arg; }, nb::raw_doc( "identity(arg)\n" "An identity function for integers and floats\n" "\n" "Args:\n" " arg (float | int): Input value\n" "\n" "Returns:\n" " float | int: Result of the identity operation"));
Writing detailed docstrings in this way is rather tedious. In practice, they would usually be extracted from C++ heades using a tool like pybind11_mkdoc.
-
Please use the following BibTeX template to cite nanobind in scientific discourse:
@misc{nanobind,
author = {Wenzel Jakob},
year = {2022},
note = {https://github.com/wjakob/nanobind},
title = {nanobind -- Seamless operability between C++17 and Python}
}