Skip to content

Commit

Permalink
Cancel-able reservations for ratelimiters (cadence-workflow#6030)
Browse files Browse the repository at this point in the history
Implements a fix for cadence-workflow#6029, allowing _normal_ ratelimiter usage to reserve tokens and be able to cancel them, assuming no racing claims on the limiter.

In the course of benchmarking this, I discovered what I believe to be some fundamental flaws with how `*rate.Limiter` works, which effectively _require_ a wrapper like I've built here.  So unless this has unsustainable latency or CPU costs, IMO it should be preferred over the underlying `*rate.Limiter` on correctness alone.

The good news is that latency seems entirely survivable, less than 2x slower in basically all cases (and usually much less).
Surprisingly, it's even slightly _faster_ for one of our common uses (reserve -> cancel, used in multi-stage ratelimits, which I suspect we will use more heavily in the future).

# What this wrapper does

A few core things:
1. re-implements `*rate.Limiter` in terms of an interface, to allow mocks and wrappers (`*rate.Limiter` has `Reserve() *rate.Reservation` which can't be mocked in any reasonable way)
2. implements ^ the interface while using a `TimeSource` so tests can resist CPU noise more easily (will also require a context-wrapper in some cases, which will be done later when needed.  there's a simple prototype in this commit.)
3. prevents `*rate.Limiter`-internal time from going backwards (leads to irrational behavior / violated limits)
4. synchronously cancels any reservations that are not immediately available (more efficient, matches our intended use, and has much more correct throughput when contended)

# The flaw

Time travel does weird things.

Essentially, it's a combination of:
- `*rate.Limiter`'s *use of* `time.Now()` is not monotonic, despite `time.Now()` *itself* being monotonic: `Now()` is collected _before_ the ratelimiter's mutex is held, so values may be arbitrarily delayed and out of order while waiting to acquire the mutex.
- `*rate.Limiter` allows its internal time to advance _and rewind_ by design, but it does not preserve enough information to e.g. rewind 1s and then advance 1s without affecting the next call.
- as a mitigating factor on all this, `*rate.Limiter` only updates its `last` time when an event is _allowed_.  this means ^ this essentially only applies if time skew is greater than your limiter's rate, especially if it's >=2x or other time-changing methods are called.

With fast benchmarks this is very easy to trigger, IRL... I'm unsure.  Some of our limits are set to >100,000 per second, or <10µs of mutex-imposed delay tolerated.  With a tight loop in a benchmark I regularly exceed 100µs of latency between locks even without parallel calls (zero contention), so I suspect it's not all that rare.

In any case, because it explicitly allows time to rewind and the simpler methods do not acquire the lock before gathering `time.Now()`, concurrent use can lead to the following behavior.  Assuming the limit refills a token every 5 time-periods, and the calls acquire the mutex in the following global order (as observed through the limiter's mutex) but with their original T=time.Now() values:
1. `Allow()` called at T0: successfully claims a token, emptying the token bucket.
    - The limiter's internal time, `last`, is now T0.
3. `Allow()` called at T1: fails, 0 tokens available.
    - The limiter's internal time does not advance, nothing changes.
5. `Allow()` called at T10: successfully claims a token.
    - The limiter's internal time advances from T0 to T10, and adds 2 tokens.
    - The 1 full token is claimed, the call is allowed, and 1 token remains after the call.
6. `Allow()` called at T0: successfully claims a token.
    - The limiter's internal time advances from T10 to T0.  No tokens have been added because the new time is lower.
    - There is 1 token in the bucket, so the call is allowed.  0 tokens remain.
7. `Allow()` called at T5: successfully claims a token.
    - The limiter's internal time advances from T0 to T5, and adds a token to the bucket.
    - The 1 full token is claimed, the call is allowed, and 0 tokens remain after the call.

From an outside observer who knows "real" time has only moved from T0 to T10, _four_ calls have been allowed when only three should have been (one each at T0, T5, and T10).  This can happen _any number of times_ between real-T0 and real-T10 if more old/new stale-time-alternating calls occur (especially with larger skew), and some benchmarks show this allowing _thousands_ of times more events than it should.

# The fix

Never allow the internal time to rewind.

It's pretty straightforward conceptually, and I am _absolutely thrilled_ that `*rate.Limiter` has these "pass in a time" APIs: they allow this kind of wrapper, both for mocking time and for implementing the fix.  Implementing it is a bit subtle and fiddly (as shown in this PR), but thankfully possible.

This eliminates rapid thrashing (all old-time events simply occur "at the same time"), and as long as "real" time is used, it eventually converges on real-time-like throughput because real time keeps advancing at a steady rate when viewed at larger time scales.  At small time scales there is no correct time available anyway, so only larger-scale behavior is important.

I've also chosen to "pin" reservations' time, because this allows canceling them well after they are claimed _if no other time-advancing calls have occurred_.

Conceptually I believe this is sound, and benchmarks show essentially perfect timing between allowed events:
- Events that occur in the real-past / at the "same" time as the reservation (or after) can be allowed to use the returned token because no possibly-over-limit calls have occurred after it (in fact, none at all).
    - If using `time.Now()`, this token likely could not be returned because time has usually advanced, and canceling does nothing when that occurs.
    - You can see an isolated version of this in the pinned-time benchmarks, notice how its sequential call throughput is essentially perfect.
- If other calls have occurred "after" the reservation, returning a consumed token cannot be guaranteed safe / may not correctly limit throughput, so it is not returned.
    - e.g. burst-1 limit-1/s and allowing late cancels would allow: reserve (0 delay) -> wait 1s -> Allow() (true) -> cancel() -> Allow() (true), which means 2 allowed events in the same second, violating the limit.

# A simpler fix we can contribute upstream

Lock before acquiring `time.Now()`, not after.

For "normal" use (`Wait(ctx)` and `Allow()`), this would ensure that time progresses monotonically, which seems like an obvious assumption for users.  Even high frequency events should behave correctly in this case.

Even if this is done, this wrapper still has some use to protect against rewinds when canceling while recovering more tokens.  In fact, the entire wrapper would likely be completely unchanged, because we have no way to know what the internal time is when canceling...

... but all non-wrapper and non-at-time-passing users Go-ecosystem-wide would be much more correct.  And adding some `AllowN(int)` (not `AllowN(time, int)`) time-free non-singular APIs would likely take care of the VAST majority of the remaining cases, leaving only Reserve-users to suffer through the details.

(I would need to think harder about it, but changing how CancelAt works to compare against `r.lim.last` might resolve it for cancellation too)
  • Loading branch information
Groxx authored May 28, 2024
1 parent 5f45da5 commit f8765f0
Show file tree
Hide file tree
Showing 6 changed files with 3,846 additions and 0 deletions.
Loading

0 comments on commit f8765f0

Please sign in to comment.