Caching and REST
(by Luke)
One of the things I was supposed to write about last week was how I'm rethinking some of Puppet's internal caching. This rethinking is a direct result of listening to ThoughtWork's IT Matters Podcast on REST (I've only listened to part 1 so far). I actually listened to the episode three times, because it's only about 20 minutes and I listened to it on a 60 minute bike ride, which worked well because it was so windy that day that I didn't hear the whole thing any of those listenings.
I'll hopefully write later about how this podcast made me rethink how environments are used in fileserving, but for now, I'm going to focus on caching.
Indirection
For a couple of months now, Puppet has had an Indirector module that is basically useful for connecting classes with collections of instances of those classes. The only reason you'd really even bother to use it is if you had multiple collections, and needed to interact with different collections at different times, but you wanted those differences to be transparent.
For instance, when retrieving node information, you just call this code:
Puppet::Node.find("mynode")
Somewhere else, you'll have configured which collection (the word I'm currently using is terminus) this uses, and the Indirector just delegates the find call to the right collection. For nodes, you might be using the exec collection, which calls an external script, turns the resulting YAML into a Node instance, and returns it (or returns nil if nothing was found).
I think the Indirector is pretty cool, and it's certainly simplified a lot of my modeling of interacting with different sources of information. Those who are familiar with REST, at least how it's usually done in the Ruby world, will recognize the find as one of the methods usually used for REST interfaces -- it's mapped to the HTTP verb get. One of the primary design goals of the Indirector was to facilitate REST interfaces, so the methods we're indirecting are, not coincidentally, exactly the methods you'd implement for REST support.
Caching
One of the later additions to the indirection code was support for cache collections. That is, you might have a canonical collection, and then a cache collection for speed or proximity purposes. Following our Node example above, if you were using the exec collection, you'd probably want to have the results cached in the yaml collection, so they were inexpensive to retrieve.
The critical question with any caching system is how to know when the cache is dirty. How do you know if you should use the cached node information or go back to the source?
I expect there are as many answers to this question as there are caching implementations, just about. I had never implemented a caching solution before, and I probably misinterpreted my discussions with Rick Bradley, because I ended choosing a not-very-good system. The current cache invalidation mechanism is based on relative versions: If the version of the cached object is older than the version of the object in the other collection, then your cache is dirty.
What is a version? Well, normally it's just the timestamp of when the instance was created. This might work okay for some systems, but in general, the timestamp ends up being pretty useless. Look at our Node example -- the timestamp of the exec collection is always later, because we retrieve the cache version, then generate a new node using the exec collection, and compare. Duh. The answer's always the same.
Even worse, in most situations the cache doesn't save you any work, because you're pulling fresh data from the original source. If we have to re-execute the external node script to get the latest node version, we haven't saved any effort at all, we've just added a bunch of useless work, which is stupid.
Puppet 0.24.4 "fixed" this problem by saying that the cached node's version was the timestamp of the node's Facts cache. If the facts are updated, then the cache needs to be updated. This seems to mostly work, but it feels like a hack for something that should be easy.
TTL
So, on to the podcast. It was a good podcast in general, and they focused a good bit on caching. At first I found this pretty strange -- why is caching an important design criterion? As they talked, though, I realized that a generalized, simple caching model is useful a lot more places than I would expect, including in Puppet.
There didn't seem to be any disagreement over the best way to handle knowing when a cache is dirty -- they apparently just use time-to-live (TTL) or expiration headers. I think it was the second time listening through that I realized that the vast majority of my caching problems could be fixed with this.
Puppet has a natural TTL for most of its information -- every host runs every half an hour, so if you set a TTL of half an hour (or whatever you're run interval is), then you'll get fresh data once a run, and cached data the rest of the time. In the above Node scenario, the exec collection would set the TTL of the node (so that your external node app could pick its own TTL), or Puppet would have a default TTL equal to the run interval. Then, when Puppet goes to check whether its cache was dirty, it could just compare the TTL against the current time -- no need to hit both collections, and no arbitrary definition of "version".
This actually makes even more sense with the current problem I'm trying to solve. I'm trying to remodel the SSL certificate signing process, and it's gotten pretty messy. With this, though, you just set the TTL of the certificate to its own internal TTL, and you use the local system as the cache the CA server as the ultimate source. If there is a local cert and it's still valid, use it; if there's a local cert but we're past its TTL, then discard it and get a fresh cert; if there's no cert, then get one from the server and cache it locally.
Next Steps
I don't have the whole thing figured out mentally yet, but I'm pretty close. At the least, the next step is to replace the current broken version-based cache with ttl-based caching. The two things I most need to resolve are:
- Who's responsible for the ttl? Is it the indirection (e.g., Node), or the collection (e.g., the external node script)?
- How does the user configure the ttl? Say I want the ttl for my node to be 30 seconds instead of thirty minutes, or I want to invalidate the cached values for all nodes; how would I do that?
Obviously, these two things are linked -- the user needs a complete configuration path from the command line or configuration file to the bit that actually sets the ttl.
For now, fortunately, I don't need to worry about it, because I can just stick with the run interval as the TTL for essentially everything I'm doing. As things get more interesting, though, we're going to want to configure these values, because....
TTL Can Help Provide Change Control
One of my primary goals in moving the catalog compiling process to REST is to enable a decoupling between compiling and applying. In other words, I want people to be able to apply a configuration without recompiling.
Imagine a configuration TTL of a week -- every host recompiles its configuration during some specific maintenance window, like Sunday morning between 2 and 6 am. They still apply their configurations every half an hour, but that's normally just validating that nothing has drifted.
Obviously, this wouldn't be used by most shops -- most people would still want all hosts to recompile every time. But for those shops that are highly worried about change control, or those who want to do rolling upgrades, where they upgrade 10% of a pool of servers at a time, this would help a lot. You take your pool of servers, trigger a recompile on 10%, and once you're confident they're working, you trigger a recompile on another 10%, and so on.
Once you can do that with Puppet, it'll feel almost enterprisey. :)
Wed, 02 Apr 2008 | Tags: programming, thoughtworks, podcast, rest, caching, design, ruby, api, luke
Podcast with Hyperic
I know it's been a long time since I posted, and there's lots to post about, but it's been a very long month with little time.
Until I get my act together (which likely won't happen until I'm in Melbourne for LCA), here's at least a snippet.
I did a podcast with John Mark Walker of Hyperic a couple of weeks ago when I was in San Francisco for the Velocity summit.
I actually haven't had a chance to listen to it yet, but apparently I do some smack talk or something. Give it a listen.