Puppet: System Administration Automated

SplayTime and Scheduling


This is a post from Luke's old blog; it is saved here statically for historical purposes, as of October 2008

I've been thinking about something akin to cfengine's SplayTime for a long time, and for some reason I immediately combine it with something that resembles scheduling.

Splaytime

SplayTime is used to provide some kind of simplistic load balancing, in that it's used to make sure that all of your clients don't contact your server in the same second. This is a real concern, since quite often you'd be running your automation tool out of cron and all of your clients will have their time synced via NTP. If, say, you run the tool every half an hour and it generally has a 10 second conversation with the server, then that means that for 10 seconds out of every half hour your server will be slammed, and the rest of the time it will just be sitting there all kinds of idle.

So, splay time causes an essentially random delay before server communication. This balances connections across some fixed interval, usually 50% of the period of execution -- e.g., if you run Puppet every half hour then you would splay across 15 minutes.

There are some really annoying things about SplayTime. An annoying problem is that you want the tools to use splay time when they run automatically but not when used interactively. More importantly, though, splay time introduces what amounts to both a wait and an uncertainty for work to complete. You update your configurations, and it is generally a minimum of 15 minutes before they have all propagated.

I guess this is the best way to do things, but I just can't escape the feeling that there's a better way. Dern.

Scheduling

Scheduling is far more complicated than splay time, and it is generally unrelated except that they are both related to timing and they could potentially conflict -- your splay time essentially provides your minimum period of work, in that if you splay up to five minutes then you can't really schedule things more then every five minutes.

There are two main types of scheduling, period-based and time-based, and there are two ways to attach a schedule to an element, either by element type or by configuration set. For instance, you might want to say that packages can only update once a day between 2 am and 4 am, that user elements should be checked every 15 minutes all day, or that all elements involved in providing DNS should be checked every 5 minutes.

There are tons of complications with scheduling, though. The first and probably most annoying complication is related to dependencies -- if element A gets applied every 5 minutes and element B gets applied every 30 minutes, and element A depends on element B, should that dependency cause element B to also get applied every 5 minutes instead of every 30?

Then there are the problems with overriding the schedule. No matter what, you always want to apply the full configuration when the machine is first installed, and you probably want to do the same thing at boot time.

When I wrote ISconf, I wrote it so that it had different contexts arranged hierarchically where 'boot' context ran the entire configuration, 'daily' ran once a day during the maintenance window, and 'hourly' ran every hour. It was a pretty simple system to operate and understand, and scheduling was easy because I just set my cron jobs up so that they specified the context to run in. While I think this would probably be about as good as any other system I'm aware of for Puppet, I don't think it would be quite ideal, either.

One of the things I've also been thinking about in terms of scheduling is how to decouple the scheduling information from the criteria used to schedule. If I write code that I want to be able to share, but I run the code on an hourly schedule and you want to run it daily, then it's pretty important that Puppet provide some way for me to send you the code without sending you the scheduling restrictions. I'm considering having something like a schedule object that sets restrictions and matches elements, along with providing another way to tag elements with additional information for exactly this matching:

service { sshd:
    running => true,
    tag => critical
}
...
schedule { often:
    repeat => 12,
    unit => hour, # run it 12 times an hour, i.e., every 5 min
    match => critical # against all items tagged 'critical'
}

This doesn't solve the problem, though, because sshd might not be critical to you.

So I don't really know what to do here, but it seems maybe like have separate explicit schedule objects might be a better idea than allowing an association of a schedule object with each element in the configuration.

I dunno.

At this point, I'm just going to add splay time and call it quits, at least until I know what people really need here.

add to del.icio.us Add to Blinkslist add to furl Digg it add to ma.gnolia Stumble It! add to simpy seed the vine TailRank post to facebook

Tue, 04 Apr 2006 | Tags: