On Cache Invalidation: Why Is It Hard? (2018)

110 pointsby steventeyabout 4 years ago

16 comments

kazinatorabout 4 years ago

Invalidating caches is super easy when you don't have to do it. You just pick some element of the cache and blow it away; nothing is affected other than performance.What is hard is invalidating a cache in that situation when an item must be removed from the cache, because it no longer exists outside the cache. If that is not done, you get a stale cache entry, which is incorrect and possibly breaks the system.So now suppose the cache is multi-threaded, distributed and whatever. Now you have edge cases and race conditions. Is it okay if the item is gone, but the cache remains for just a little window of time? If not, can we remove it from the cache first, then for real? What if in between those two, something accesses it so it reappears in the cache? Pardon me, did I say the cache? I meant seventeen different caches throughout the distributed system ...

评论 #26696242 未加载

评论 #26697350 未加载

breakingcupsabout 4 years ago

> There are only two hard things in Computer Science: cache invalidation and naming thing.I've always heard it as "There are only two hard things in Computer Science: cache invalidation, naming things and off-by-one errors."

评论 #26698594 未加载

评论 #26700027 未加载

评论 #26699712 未加载

评论 #26698909 未加载

评论 #26699979 未加载

brundolfabout 4 years ago

The way I see it there are only two ways to perfectly invalidate a cache.The first is to only cache the results of pure functions whose arguments can be perfectly equality-checked. But if your arguments aren't trivial to compare, this can incur some overhead even on cache hits. It also doesn't work for a stateful system, because not all relevant information gets passed in as arguments.So the second way is to track every possible mutation to any data in the system, associate each piece of data with all caches that depend on it, and "push" invalidations to those caches when something changes. In the context of a front-end app, this is what MobX does. Salsa is a Rust library that does something similar and is used by rust-analyzer. Broadly this falls under the term "reactivity", though I've seen the word used to describe several different related concepts.Most of the invalidation strategies people use in practice are imperfect and ad-hoc and depend on domain knowledge. Often you get a version of #2, but done manually instead of automatically: "I know that when X happens, we need to bust cache Y, so I'll manually code that behavior". This is the "hard" version that the adage refers to.

评论 #26700355 未加载

TeMPOraLabout 4 years ago

Here's my take.If you could cache all the inputs that go into a pure function, you'd never need to invalidate the cache. Your cache becomes just a lazily-created tabulation of your function - which is fixed, as a map is just another representation of a mathematical function.So cache invalidation happens only where your cache key isn't containing all the parameters of the cached computation - your cache isn't a mathematical function. Now the problem is, in real world, it's almost impossible to fully capture all the parameters of a computation in a way that gives you a usable cache (one that trades time/CPU for memory). So we need to track the possible changes in the "hidden state", and invalidate a cache entry if we detect such state. That's where all the complexity sits.For instance, take a problem of a function countWords(file), that counts words in a file. How would we cache its results? If we key by file name, the cache will be wrong when the file gets changed. Keying by the hash of file contents is out of the question, because hashing is at least O(n) with respect to file size, just as countWords(), so we gain nothing. Keying by file name + modification date will fail if the clock changes, or file gets updated faster than the time resolution we're using. Not to mention, we'll have false positives if the date changes but the contents stay the same.But assuming that the filesystem is sane and nobody messes with the clock, what else could happen? The definition of what a "word" is can change, making the entire cache bad. We may have found a bug in countWords(), updated the code, and now the cached results are also bad. The contents of the cache may have been changed externally (e.g. if shared by multiple functions, or multiple processes using countWords(), with different definition of "word" or with partially deployed bugfix). The code of the cache itself may have been updated to fix a storage bug there. Etc.At some point you have to decide, which of those issues you care about and which you ignore - and for those you care about, how to monitor for changes, propagate them to the cache, and do it fast enough that none of the cache users will see invalid data in between detection and propagation. That's how cache invalidation is hard.

评论 #26698020 未加载

jazaabout 4 years ago

Start by doing the easiest thing: nuke all caches whenever anything changes! It's guaranteed to avoid all invalidation edge cases, and it's "good enough" performance-wise for many applications. If and when that proves insufficient, start implementing more fine-grained invalidation. Doing otherwise is just premature optimisation.

评论 #26696677 未加载

infogulchabout 4 years ago

Adding a cache to improve Availability and Partition tolerance while pretending that your (now) distributed system is also Consistent is not "hard", it's impossible. Sorry, CAP applies to caches too.You inflict this hard problem on yourself when you try to throw away the time context of the data. Just... don't do that. Contextualize data with a commit log sequence id and all of these problems vanish. If the cache has data valid up to sequence id XYZ, include that context in the response -- that's all a cache can say anyways:<pre><code> client: hey can I have /foo/bar cache: sure, /foo/bar was 'abc' as of #XYZ (or) client: hey can I have /foo/bar as of at least #XYZ+7 cache: sure let me look that up... </code></pre> And thus the cache invalidation problem is reduced to the naming problem, and you have at least one fewer off-by-one errors. Change my mind. :)

评论 #26708769 未加载

kevincoxabout 4 years ago

The main priblem is that sticking a cache system in front of a database turns it into a homemade distributed database. Often times your use case is simple enough, and your consistency requirements weak enough, that it doesn't cause issues. But these homebrew systems are not made with the rigor and focus of a dedicate distributed database solution and this will eventually bite you, hopefully gently in a way that may make sense to just live with, but sometimes very painfully and that is when you remember that cache invalidation is indeed hard.It seems that the "correct" solution is building the cache into the database. Allow tying the cached result to the snapshot and inputs that it was generated against and use the regular controls for stale read tolerance to ensure that this cache entry is up-to-date. This even let's you use cache entries within a strongly consistent transaction (although staleness will probably hurt your hit rate)

评论 #26700634 未加载

bkuehlabout 4 years ago

Interestingly enough, in my entire career so far, for things that weren't cached that should have been or were already cached and working, the one invalidation point was the save page of the entity. Sometimes it's not that hard for vast majority of cases.

评论 #26699878 未加载

评论 #26700118 未加载

stcredzeroabout 4 years ago

On the multi-core systems of today, why are we using memory and the cache as a communication device between threads running on different cores? Don't we have the spare transistors and space on the die to implement purpose-built communication hardware between cpu cores which does not fall victim to false sharing, and the like? Isn't the current setup a bit hacky?

评论 #26695778 未加载

评论 #26695489 未加载

评论 #26695353 未加载

评论 #26695595 未加载

评论 #26695399 未加载

评论 #26697451 未加载

评论 #26697358 未加载

评论 #26698033 未加载

surfsvammelabout 4 years ago

I've been implementing simple caching for an enterprise system at many bank over the last couple of years. For us it has been very very simple.The system has the following characteristics:- It's a big system with distributed parts. At any given time any piece of data might in fact be different in different parts of the system (before any change is properly distributed to the entire system).- Often a single piece of data is accessed a lot in very short succession.Since the system is kind of already built to handle the fact that the data is only eventually consistent, caching for performance gains have been very simple for us. We just implement a simple cache with a short lifetime. That's it.Those are two characteristics that I try to look out for. Also, when I build something new, I try to build the system to be able to live in a world where data is not necessarily always the latest (as long as there is some kind of mechanism for getting notified that the data is out of sync when trying to change it, it seems to work).

baqabout 4 years ago

Cache layers are just (‘just’) distributed databases with narrow requirements. CAP theorem applies, etc. It shouldn’t be a surprise that it’s hard, but due to the aforementioned narrow application space it might be easy to disassociate caches from distributed systems.

moominabout 4 years ago

Relevant, or at least entertaining: <a href="http://thecodelesscode.com/case/220?lang=pt" rel="nofollow">http://thecodelesscode.com/case/220?lang=pt</a>

eyelidlessnessabout 4 years ago

Cache invalidation isn’t hard, it’s just where hard things show up for easy problems. Cache invalidation in a nutshell:For some given memory constraint, give me the same result with lookup performance for known inputs.That’s not hard because of memory constraints (excepting very specialized use cases), that’s hard because the things that benefit from caching usually have implicit inputs. You’re trying to cache “database query” where “database” has other stuff mutating state. You’re not caching the call, you’re caching the side effects it depends on.This use case is trivially solved if you plan for it. Use the algorithms from a history based system. Caching the state of a given object in git is easy: its cache is its hash. Either that part of the tree changed or it didn’t.The key is to remove implicit dependencies from your cache path. Everything that you cache states its dependencies, anything that changes in them reports back.You can get more specific than that for finer grained control but we’re already well past invalidation being the hard problem. It’s just dependency granularity.

评论 #26696047 未加载

bitforgerabout 4 years ago

If I had a dollar for every caching bug I've encountered, I'd be a rich man.

评论 #26696596 未加载

throw0101aabout 4 years ago

Ah, the old saying:* There are only two things that are hard in computer science: cache invalidation, naming things, and off-by-one errors.

slt2021about 4 years ago

my approach to caching1. cache expensive computation on read.2. invalidate cache entry on update/delete.for CRUD like applications it is trivial, since you have separate interfaces for read (GET) and update (POST).

评论 #26699218 未加载