Skip to content

Latest commit

 

History

History
260 lines (201 loc) · 11.8 KB

RATIONALE.adoc

File metadata and controls

260 lines (201 loc) · 11.8 KB

Rational: Or why am I bothering to rewrite nanomsg?

Note
You might want to review Martin Sustrik’s rationale for nanomsg vs. ZeroMQ.

Background

I became involved in the nanomsg community back in 2014, when I wrote mangos as a pure Golang implementation of the wire protocols behind nanomsg. I did that work because I was dissatisfied with ZeroMQ’s licensing and C++ baggage, and I needed something that would work with Golang on Solaris, which at the time lacked support for cgo (so I could not just use an FFI binding.) This gave me a lot of detail about the internals of nanomsg and the SP protocols. At the time, it was the only alternate implementation those protocols.

It would not be wrong to say that one of the goals of mangos was to teach me about golang. It was my first non-trivial golang project.

While working with mangos, I wound up implementing a number of additional features, such as a TLS transport, the ability to bind to wild card ports, and the ability to determine more information about the sender of a message. This was incredibly useful in a number of projects.

I initially looked at nanomsg itself, as I wanted to add a TLS transport to it, and I needed to make some bug fixes (for protocol bugs for example), and so forth.

State Machine Madness

What I ran into was a challenging mess of state machines. nanomsg has dozens of state machines, many of which feed into others, such that tracking flow through the state machines is incredibly painful.

Worse, these state machines are designed to be run from a single worker thread. This means that a given socket is entirely single theaded; you could in theory have dozens, hundreds, or even thousands of connections open, but they would be serviced only by a single thread. (Admittedly non-blocking I/O is used to let the OS kernel calls run asynchronously perhaps on multiple cores, but nanomsg itself runs all socket code on a single worker thread.)

There is another problem too — the inproc code that moves messages between one socket and another was incredibly racy. This is because the two sockets have different locks, and so dealing with the different contexts was tricky (and consequently buggy). (I’ve since, I think, fixed the worst of the bugs here, but not after many hours of pulling out hair.)

The state machines also make fairly linear flow really difficult to follow. For example, there is a state machine to read the header information. This may come a byte a time, and the state machine has to add the bytes, check for completion, and possibly change state, even if it is just reading a single 32-bit word. This is a lot more complex than most programmers are used to, such as read(fd, &val, 4).

Now to be fair, Martin Sustrik had the best intentions when he created the state machine model around which nanomsg is built. I do think that from experience this is one of the most dense and unapproachable parts of nanomsg, in spite of the fact that Martin’s goal was precisely the opposite. I consider this a "failed experiment" — but hey failed experiments are the basis of all great science.

Thread Challenges

While nanomsg is mostly internally single threaded, I decided to try to emulate the simple architecture of mangos using system threads. (mangos has the benefit of goroutines.) Having been well and truly spoiled by Solaris/illumos threading (and especially Solaris/illumos kernel threads), I thought this would be a reasonable architecture.

Sadly, this initial effort, while it worked, scaled incredibly poorly — even so-called "modern" operating systems like macOS 10.12 and Windows 8.1 simply melted or failed entirely when creating any non-trivial number of threads. (To me, creating 100 threads should be a no-brainer, especially if one limits the stack size appropriately. I’m used to be able to create thousands of threads without concern. As I said, I’ve been spoiled. If your system falls over at a mere 200 threads I consider it a toy implementation of threading. Unfortunately most of the mainstream operating systems are therefore toy implementations.)

Chalk up another failed experiment.

I did find another approach which is discussed further.

File Descriptor Driven

Most of the underlying I/O in nanomsg is built around file descriptors, and it’s internal usock structure, which is also state machine driven. This means that implementing new transports which might need something other than a file descriptor, is really non-trivial. This stymied my first attempt to add OpenSSL support to get TLS added — OpenSSL has it’s own struct BIO for this stuff, and I could not see an easy way to convert nanomsg’s usock stuff to accomodate the struct BIO.

Poll

In order to support use in event driven programming, asynchronous situations, etc. nanomsg offers non-blocking I/O. In order to make this work for end-users, a notification mechanism is required, and nanomsg, in the spirit of following POSIX, offers a notification method based on poll(2) or select(2).

In order for this to work, it offers up a selectable file descriptor for send and another one for receive. When events occur, these are written to, and the user application "clears" these by reading from them. (This is done on behalf of the application by nanomsg’s API calls.)

This means that in addition to the context switch code, there are not fewer than 2 extra system calls executed per message sent or received, and on a mostly idle system as many as 3. This means that to send a message from one process to another you may have to execute up to 6 extra system calls, beyond the 2 required to actually send and receive the message.

There are cases where this file descriptor logic is easier for existing applications to integrate into event loops (e.g. they already have a thread blocked in poll().)

But for many cases this is not necessary. A simple callback mechanism would be far better, with the FDs available only as an option for code that needs them. This is the approach that we have taken with nng.

As another consequence of our approach, we do not require file descriptors for sockets at all, so it is possible to create applications containing many thousands of inproc sockets with no files open at all. (Obviously if you’re going to perform real I/O to other processes or other systems, you’re going to need to have the underlying transport file descriptors open.)

POSIX APIs

Another of Martin’s goals, which seems worthwhile at first, was the attempt to provide a familiar BSD POSIX API. As a C programmer, this really attracted me.

The problem is that the POSIX APIs are actually really horrible. In particular the semantics around cmsg are about as arcane and painful as one can imagine. Largely, this has meant that extensions to the cmsg API simply have not occurred in nanomsg.

The cmsg API specified by POSIX is as bad as it is because POSIX had requirements not to break APIs that already existed, and they needed to shim something that would work with existing implementations, including getting across a system call boundary. nanomsg has never had such constraints.

Attempting to retain low numbered "socket descriptors" had its own problems — a huge source of use-after-close bugs, which made the use of nn_close() incredibly dangerous for multithreaded sockets. (If one thread closes and opens a new socket, other threads still using the old socket might wind up accessing the "new" socket without realizing it.)

The other thing is that BSD socket APIs are super familiar to UNIX C programmers — but experience with nanomsg has taught us already that these are actually in the minority of nanomsg’s users. Most of our users are coming to us from C++ (object oriented), Java, and Python backgrounds. For them the BSD sockets API is frankly somewhat bizarre and alien.

With nng, we realized that constraining ourselves to the mistakes of the POSIX API was hurting rather than helping. So nng provides a much friendlier interface for getting properties associated with messages.

In nng we also generally try hard to avoid reusing an identifier until no other option exists. This generally means most applications won’t see socket reuse until billions of other sockets have been opened.

Compatibility

Of course, there are a number of existing nanomsg consumers "in the wild" already. It is important to continue to support them. So I decided from the get go to implement a "compatibility" layer, that provides the same API, and as much as possible the same ABI, as legacy nanomsg. However, new features and capabilities would not necessarily be exposed to the the legacy API.

Today nng offers this. You can relink an existing nanomsg binary against nng instead of nanomsg, and it usually Just Works. Source compatibility is almost as easy, although the application code needs to be modified to use different header files. (Note: I am considering changing this in the future.)

Asynchronous IO

As a consequence of our experience with threads being so unscalable, we decided to create a new underlying abstraction modeled largely on Windows IO completion ports. (As bad as so many of the Windows APIs are, the IO completion port stuff is actually pretty nice.) Under the hood in nng all I/O is asynchronous, and we have struct nni_aio objects for each pending I/O. These have an associated completion routine.

The completion routines are usually run on a separate worker thread (we have many such workers; in theory the number should be tuned to the available number of CPU cores to ensure that we never wait while a CPU core is available for work), but they can be run "synchronously" if the I/O provider knows it is safe to do so (for example the completion is occuring in a context where no locks are held.)

There is still a lot of performance tuning work to do, and there is a real need to expose the AIO structures to user code, which is something that I will be doing in the coming days. This should enable a much faster and lighter weight I/O model, especially for event driven programs.

New Transports (ZeroTier)

The other, most critical, motivation behind nng was to enable an easier creation of new transports. In particular, one client (Capitar IT Group BV) contracted the creation of a ZeroTier transport for nanomsg.

After beating my head against the state machines some more, I finally asked myself if it would not be easier just to rewrite nanomsg using the model I had created for mangos.

In retrospect, I’m not sure that the answer was a clear and definite yes in favor of nng, but for the other things I want to do, it has enabled a lot of new work. The ZeroTier transport was created with a relatively modest amount of effort, in spite of being based upon a connectionless transport. I do not believe I could have done this easily in the existing nanomsg.

In the coming weeks I expect to create transports for TLS, websocket, and websocket over TLS. I expect the websocket transport to support multiple sockets with different path components bound to the same TCP port. This might be possible with nanomsg, but it wouldn’t be easy. With nng, it looks fairly straight-forward.

Towards nanomsg 2.0

It is my intention that nng ultimately replace nanomsg. I do think of it as "nanomsg 2.0". In fact "nng" stands for "nanomsg next generation" in my mind. Some day before too long I’m hoping that the various website references to nanomsg my simply be updated to point at nng. It is not clear to me whether at that time I will simply rename the existing code to nanomsg, nanomsg2, or leave it as nng. (One possible issue is that Martin Sustrik retains the trademark for "nanomsg".)