You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently we have a bit of a brittle setup with how we sync these, and recently we saw major issues with these when upstream rate limiting was introduced. We sync these every 6 hours, but we perform those requests as quickly as we can, this means we can easily (and unintentionally) cause upstream servers to get overwhelmed. Similar to how you would load balance request load over a set of servers, we should load balance these operations over time.
It's probably not the most practical to litter time.sleep() everywhere, so a middle ground is to bucket sets of users into smaller groups synced at certain intervals. We'd run a task at intervals determined by the number of buckets. That task would run sync the users in the bucket for that time slot. We'd probably need to account for some kind of failure-healing too. There are a few options I can see on bucketing:
Modulo Bucketing
A simple implementation of this would probably be to assign users into buckets with something like user.id % num_buckets, which would work, but it'd mean whenever we change the number of buckets (increasing reduces request density), every user's time slot would get shuffled, which means some users would get synced less frequently for a bit, while others would get synced more frequently.
Hash Bucketing
A better option would be to make the timing of syncs consistent even across bucket sizing changes so that the user's data is syncing on a consistent interval. This is a method used in distributed systems to spread load, which is basically what we're doing, but over time. We'd do this by bucketing the users with consistent hashing, which will give us a fairly uniform distribution while giving each user a deterministic time slot when we sync. This works by hashing a value (user.id for us) and then assigning that hash to a bucket. In our case, we'd take the integer value of that hash and normalize it to a time of day. Here's some Jupyter notebook code that does this for a sequence of 100k integer ids:
importhashlibfrommatplotlibimportpyplotaspltimportnumpyasnp%matplotlibinlinehashes=np.array([
int.from_bytes(hashlib.md5(str(x).encode("utf-8")).digest(), 'big')
forxinrange(100000)
], dtype=np.float)
seconds_in_day=24*60*60hashes/=float(2**128) # md5 is 128-bit, normalize the values to a 0..1 rangehashes*=seconds_in_daybucket_size_seconds=60*15num_buckets=seconds_in_day//bucket_size_secondsplt.hist(hashes, bins=num_buckets, range=(0,seconds_in_day))
This plotted pretty uniformly:
(x-axis is seconds in day)
The text was updated successfully, but these errors were encountered:
Currently we have a bit of a brittle setup with how we sync these, and recently we saw major issues with these when upstream rate limiting was introduced. We sync these every 6 hours, but we perform those requests as quickly as we can, this means we can easily (and unintentionally) cause upstream servers to get overwhelmed. Similar to how you would load balance request load over a set of servers, we should load balance these operations over time.
Approach
It's probably not the most practical to litter
time.sleep()
everywhere, so a middle ground is to bucket sets of users into smaller groups synced at certain intervals. We'd run a task at intervals determined by the number of buckets. That task would run sync the users in the bucket for that time slot. We'd probably need to account for some kind of failure-healing too. There are a few options I can see on bucketing:Modulo Bucketing
A simple implementation of this would probably be to assign users into buckets with something like
user.id % num_buckets
, which would work, but it'd mean whenever we change the number of buckets (increasing reduces request density), every user's time slot would get shuffled, which means some users would get synced less frequently for a bit, while others would get synced more frequently.Hash Bucketing
A better option would be to make the timing of syncs consistent even across bucket sizing changes so that the user's data is syncing on a consistent interval. This is a method used in distributed systems to spread load, which is basically what we're doing, but over time. We'd do this by bucketing the users with consistent hashing, which will give us a fairly uniform distribution while giving each user a deterministic time slot when we sync. This works by hashing a value (
user.id
for us) and then assigning that hash to a bucket. In our case, we'd take the integer value of that hash and normalize it to a time of day. Here's some Jupyter notebook code that does this for a sequence of 100k integer ids:This plotted pretty uniformly:
(x-axis is seconds in day)
The text was updated successfully, but these errors were encountered: