Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evenly distribute enrollment/certificate/grade refreshes over time #4641

Open
2 tasks
rhysyngsun opened this issue Sep 3, 2020 · 0 comments
Open
2 tasks

Comments

@rhysyngsun
Copy link
Contributor

Currently we have a bit of a brittle setup with how we sync these, and recently we saw major issues with these when upstream rate limiting was introduced. We sync these every 6 hours, but we perform those requests as quickly as we can, this means we can easily (and unintentionally) cause upstream servers to get overwhelmed. Similar to how you would load balance request load over a set of servers, we should load balance these operations over time.

  • Add redbeat, so celery cron jobs run reliably
  • Distribute syncing operations over time

Approach

It's probably not the most practical to litter time.sleep() everywhere, so a middle ground is to bucket sets of users into smaller groups synced at certain intervals. We'd run a task at intervals determined by the number of buckets. That task would run sync the users in the bucket for that time slot. We'd probably need to account for some kind of failure-healing too. There are a few options I can see on bucketing:

Modulo Bucketing

A simple implementation of this would probably be to assign users into buckets with something like user.id % num_buckets, which would work, but it'd mean whenever we change the number of buckets (increasing reduces request density), every user's time slot would get shuffled, which means some users would get synced less frequently for a bit, while others would get synced more frequently.

Hash Bucketing

A better option would be to make the timing of syncs consistent even across bucket sizing changes so that the user's data is syncing on a consistent interval. This is a method used in distributed systems to spread load, which is basically what we're doing, but over time. We'd do this by bucketing the users with consistent hashing, which will give us a fairly uniform distribution while giving each user a deterministic time slot when we sync. This works by hashing a value (user.id for us) and then assigning that hash to a bucket. In our case, we'd take the integer value of that hash and normalize it to a time of day. Here's some Jupyter notebook code that does this for a sequence of 100k integer ids:

import hashlib
from matplotlib import pyplot as plt
import numpy as np

%matplotlib inline

hashes = np.array([
    int.from_bytes(hashlib.md5(str(x).encode("utf-8")).digest(), 'big')
    for x in range(100000)
], dtype=np.float)

seconds_in_day = 24 * 60 * 60

hashes /= float(2**128) # md5 is 128-bit, normalize the values to a 0..1 range
hashes *= seconds_in_day

bucket_size_seconds = 60 * 15
num_buckets = seconds_in_day // bucket_size_seconds

plt.hist(hashes, bins=num_buckets, range=(0,seconds_in_day))

This plotted pretty uniformly:
index
(x-axis is seconds in day)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant