Puppet: System Administration Automated

Error Handling


I need to implement error handling within Puppet. At this point, errors are largely handled internally and always the same way, which is probably not what most people want. Even if it is okay to make everything consistent, it is probably a good idea to rethink and revamp error handling as it exists now.

In particular, I don't have a clear idea of how to deal with part of a system failing. If a single state fails, either at initialization or run time, should the whole object fail? If a single object in a dependency tree fails, what should the rest of the tree to?

I've been thinking about these questions for a while, and I don't really have a clear answer. It seems like errors are generally an indication of a misconfiguration, so it would make sense to fail the whole object and require a reconfiguration, but I'm not sure that's really the best option.

add to del.icio.us Add to Blinkslist add to furl Digg it add to ma.gnolia Stumble It! add to simpy seed the vine TailRank post to facebook

Fri, 21 Apr 2006 | Tags: , ,


What Error Handling Means in Puppet


I was recently questioned about my post about error handling in Puppet; someone responded that exceptions are a potential solution. Rather than responding directly to him, I want to lay out some of the issues I've wrestled with in terms of error handling and explain why I haven't yet created what looks like a solution in Puppet.

First, let me point out that Puppet extensively uses exceptions internally, and it already does a good job of not letting those exceptions kill Puppet processes (although I'm sure that some can still leak out). The question isn't how to raise an exception, or whether I should try to trap those exceptions, it's how to react upon catching one.

For instance, given the following code:

file { "/export/docroots/reductivelabs.com/htdocs":
    owner => luke, group => web, mode => 664, recurse => true
}

If the luke user does not exist, there will obviously be an error, but how should Puppet behave on such an error? Should the other work still be done, or should that file just be ignored entirely? Should it be configurable? Should it affect any other objects?

What about this code:

file { "/etc/apache":
    source => "puppet://server/apps/apache", recurse => true
}

service { apache:
    subscribe => file["/etc/apache"],
    running => true
}

This results in the apache service getting restarted if any of the config files get updated. What should happen if the service cannot successfully restart? Should the changes to the config files be rolled back? Should the system just give up and notify the user? Should it retry? Again, should it be configurable? Should it affect other parts of the system? If there are objects that further depend on the apache service, should those objects not be checked or managed until the service is functional again?

There are basically four stages in managing a Puppet object in which there can be a failure:

Note that the first stage is what I call config time, because it happens when Puppet receives its configuration from the central server. All of the other stages are at run time, in that they can only happen when Puppet is actively doing something with the object. These stages are quite different, and some reactions will only make sense at certain stages. For instance, if an object was not able to be instantiated, then all of its dependent objects will also fail because they will list it as a dependency and it will not exist in the configuration.

There are important questions that affect how Puppet should react to a failure:

There are really only a few reactions that Puppet can take:

For a long time, I've been planning on developing functionality around an onerror metaparameter. This would basically allow the user to specify how Puppet should behave in the case of an error. Any of the above reactions could be specified, so valid values would be, I guess, ignore, ignoreobj, ignoretree, ignoreall, rollback, rolldeps, rollall, and fail.

This seems somewhat straightforward, in that each of these reactions is relatively clear and explainable. I just haven't had the time to go through and completely characterize each of them and develop the bulwark to support all of them. I have transactions and I can roll them back, so I'm prepared to do this eventually, but I'm not currently convinced that users will be doing much with this in the short-to-medium term so it hasn't been a focus.

Fortunately it's probably not much work, so if someone decides it's a high priority it should be able to be done somewhat quickly. The only real problem is the variety of places in which the error can occur. If you were not able to even instantiate the object because of an error, you still need to know what that object's preference for error-handling is, and you need the reactivity to that error to be at a completely different part of the process.

I think I'm largely waiting until I have customer feedback to do anything here. There's a certain level of usage maturity that's required before any action really makes sense, so I'm not in a hurry.

add to del.icio.us Add to Blinkslist add to furl Digg it add to ma.gnolia Stumble It! add to simpy seed the vine TailRank post to facebook

Fri, 04 Nov 2005 | Tags: ,