-
Notifications
You must be signed in to change notification settings - Fork 1.6k
The standard approach to parallelism is to use threads. This model however has many performance drawbacks, so seastar uses a different model, which will be explained below.
Threads, as well as processes, are an abstraction provided by operating system. Instead of having one or a small fixed number of processors, the operating system allows the user to create as many virtual processors as they like, and multiplexes those virtual processors on top of the physical processors. These virtual processors are called threads (if they share memory with each other) or processes (if they do not).
The simplest approach to threading is thread-per-connection. For every connection that needs to be served, a thread is created, running a read-process-respond loop in that thread. This approach has a number of issues, limiting it to the simplest applications:
- With a large number of connections, many threads are created. Threads are heavyweight objects so allocating many of them consumes resources (mostly, memory).
- If many connections are active at the same time, many threads will run concurrently. The operating system will be forced to switch between the threads quickly; as this is an expensive operation, performance will drop.
- It is hard to guarantee timely processing of an event. If the system has too many threads, they may all be busy, and the system may not schedule the thread serving our event in time. If there are too few threads, they may all be blocked on I/O, so the system may not be able to serve our request even though processor power is available. Getting just the right number of threads is a hard problem.
As a result, most threaded applications now use thread pools. Here, a large number of connections are multiplexed on top of a smaller number of threads (which are themselves multiplexed on top of a number of processors). A read thread will wait for a connection to become active, assign it to an idle thread from the thread pool, which will then read a request, process it, and respond.
These threaded designs, however, still have performance issues:
- Data that is shared among connections needs to be locked. Locks, in the worst case, cause excessive context switches and stalls. In the best case, they are expensive operations, and do not scale well on large machines with a dozen or more cores.
- Shared writable data will be accessed from threads on multiple cores. This requires the processor to move data from one processor's cache to the other, a slow operation.
- Data allocated on one processor may be accessed on another, or freed on another. With today's NUMA architectures, this imposes a severe penalty on memory access time, and slows down the memory allocator.
Seastar uses sharding, or partitioning, to manage multiple cores. Instead of each core sharing responsibility for all connections and all data with all other cores, each core is assigned a subset of the connections and data on the machine. If a computation on a core needs access to data residing on another core, it must explicitly send a message to the remote core, asking it to read or write the data, and waits for a result.
To partition connections, seastar automatically divides connections among the cores. It utilizes the ability of modern Network Interface Cards (NICs) to provide a packet queue for each core and to automatically divert a subset of packets to those queues. As a result each seastar core receives a share of all connections, and all packets belonging to those connections are processed by that core.
Seastar cannot partition data automatically. The user must choose a partitioning method and divert processing to the correct core. Some partitioning strategies include:
- Hashed key: a key-value store, for example, may use the low-order bits of the key to select a core. This is suitable when you have a large number of objects that are usually accessed using a primary key.
- Replication:: data is replicated across all cores. Reads are served from the local core, while modifications are broadcast to all cores. This strategy is useful for frequently read data, rarely written data of small to moderate total capacity.
One class of applications is particularly suited for partitioning -- the class of scale-out servers. These are already partitioned across nodes, so partitioning among cores merely extends the model with node-internal shards.
There are many benefits to sharded data and networking:
- Locality: a core always accesses data that was allocated and manipulated on the same core. This is beneficial for memory allocators, CPU caches, and for NUMA architectures.
- Locking: the need for locking is much reduced -- often there is no need for locking at all, since all access to a data item is implicitly serialized. When locking is needed (for example, when I/O is needed while processing data), it is accomplished using normal processor instructions rather than serialized atomic read-modify-write instructions.
Unfortunately, sharding is not without disadvantages:
- Not all applications are amenable to sharding; those applications cannot benefit at all.
- Imbalance among the cores due to uneven partitioning can result in some cores being overloaded while others are relatively idle, wasting resources