< index

TTL cache & the thundering herd

· 10 minute read

TTL-based caching is a policy in which a data item, once retrieved, remains valid for a period known as the “time-to-live”. Memcached or Redis are my favoured in-memory TTL caches: key value stores with the ability to set key expiry times. Memcached is simple and dependable, with LRU (least recently used) key ejection on memory overflow. Redis shines if you want a fast datastore that is more than just cache, although it can be used in a similar LRU bounded memory setup to Memcached.

Such caches are typically used when a heavyweight process executed on-demand is causing problems under peak traffic. An effective way to improve performance is to cache the process result with a TTL. To be cacheable the result of this process is not time-critical: for example dynamic content generation from a CMS where a small delay to an content update is acceptable.

Use the right caching tool: if a single heavyweight DB query is causing problems, you should be looking at utilising the query cache in your DB engine. Maybe this means changing dynamically inserted parameters like datetimes to round to the nearest 30s so the query is cacheable… or something.

Vanilla TTL cache

A naive TTL cache implementation is shown below for request 1 (R1>, cold cache) and request 2 (R2>, warm cache):

    APP                       MEMCACHE

R1> data = null ------------> GET data
R1> data = longtime()
R1> CACHE data -------------> SET data TTL 30s
R1> display(data)

R2> data = xxx -------------> GET data
R2> display(data)

This will provide some benefit in terms of an overall reduction in the number of executions of longtime() but it won’t actually increase your peak capacity. To show why we need numbers and a timeline:

Over 2 seconds immediately following a cold start or the cache key expiring there are 4 evenly spaced requests R1, R2 and R3 and R4:

TIME    APP                       MEMCACHE

0s   R1> data = null ------------> GET data
     R1> data = longtime()
0.5s R2> data = null ------------> GET data
     R2> data = longtime()
0.6s R1> CACHE data -------------> SET data TTL 30s
     R1> display(data)
1s   R3> data = xxx -------------> GET data
     R3> display(data)
1.1s R2> CACHE data -------------> SET data TTL 30s
     R2> display(data)
1.5s R4> data = xxx -------------> GET data
     R4> display(data)

While the longtime() process is still running from R1, R2 arrives and spawns another longtime() process as the cache is empty. Without any cache each of the 60 requests arriving over 30s would call longtime(), and the max concurrent longtime() processes would be 2. With this cache implementation, longtime() will be called only twice over 30s, but the max concurrent longtime() processes is still 2. This burst of concurrent uncached requests every 30s as the cache expires is referred to as a “thundering herd” of uncached requests.

Peak capacity is constrained by max concurrency so is in theory unchanged with or without the cache. In reality this implementation will provide benefit applied across a range of different longtime() processes and randomly distributed traffic: the likelihood of expiry times of different cache keys lining up is rare although is a problem on cold cache startup.

Avoiding Uncached Traffic Bursts

Uncached bursts of requests in a vanilla implementation occur:

You either need to avoid these situations in high traffic areas, or introduce a shared mutex to ensure a single longtime() process with further concurrent requests waiting for it to complete. I prefer the avoidance approach.

Ejection shouldn’t be a problem for high traffic cache keys, which will be protected by the LRU algorithm. Cold start occurs after the cache is reset or a a new cache is deployed, and can be avoided by gradually ramping up traffic from your load balancers to a new or reset cache or having warm-up script that triggers population before deployment. Such techniques along with multi-host cache resilience is probably a topic for a different article so I won’t dwell on it.

To avoid expiry, you simply need to re-update the cache before the expiry time is reached…

Fixed Probability Cache Update

To update the cache before expiry we can use a probability-based update mechanism. On a cache hit, longtime() is executed anyway and cache updated based on a small random probability.

APP                       MEMCACHE

data = xxx -------------> GET data
IF (RANDOM() < 1%)
    data = longtime()
    CACHE data ---------> SET data TTL 30s

The million dollar question: what should the probability be? It’s tricky and depends on traffic levels, expiry time, cache size, and how much you need to avoid cache expiry. If traffic was 100 requests/s and cache TTL of 30s, we expect ~3000 cached requests before cache expiry. You might choose a probability of 0.1% as you’d expect one in every 1000 requests to update the cache, i.e every 10s.

Variable Probability Cache Update

This answers the question “what should the probability be” by setting the probability extremely small when the cache has just been updated, with probability increasing the closer the cache key gets to expiry. It does require that the expiry time is available to the application which may need to be stored within the cached data blob.

APP                              MEMCACHE

data, expiryTime = xxx, yy ----> GET data
chance = (expiryTime - Now()) / TTL
IF (RANDOM() < chance)
    data = longtime()
    CACHE data ---------> SET data TTL 30s

The pseudocode above illustrates a simple linear change in probability over time: in reality a exponential relationship will provide more consistent cache refresh rates over a range of traffic levels but this is really something you need to load test with the performance of your own application.

Out of Band Cache Update

While you are thinking about caching data generated on-request to improve the capacity of your system, you should step back and consider more fundamental changes to your application architecture. Turning the database inside out on the Confluent blog is an excellent read around this area. Performing heavy processes on-request result in a bad user experience (long request times): these don’t go away with an on-request generated cache, just become less frequent.

You should consider a push rather than pull based architecture. A change in the master dataset results in a change message to which a process can subscribe to execute longtime() and push the result into a fast-access read projection datastore (the “cache”). This pattern is known as Command Query Responsibility Segregation (CQRS). The read projection is essential to the operation of your application so must be resilient and sized correctly: if you can only cope with your traffic with the TTL cache operating anyway, you’re already in this situation.

Or consider making the application where the change is made responsible for denormalising and populating the read cache. i.e. longtime() is executed by the application through which the update to the underlying data is made. Less flexible than a CQRS approach, but easier to implement – in particular there is less burden around handling the asynchrous nature of CQRS.

Such approaches will become essential as you scale further.