Error Handling
I need to implement error handling within Puppet. At this point, errors are largely handled internally and always the same way, which is probably not what most people want. Even if it is okay to make everything consistent, it is probably a good idea to rethink and revamp error handling as it exists now.
In particular, I don't have a clear idea of how to deal with part of a system failing. If a single state fails, either at initialization or run time, should the whole object fail? If a single object in a dependency tree fails, what should the rest of the tree to?
I've been thinking about these questions for a while, and I don't really have a clear answer. It seems like errors are generally an indication of a misconfiguration, so it would make sense to fail the whole object and require a reconfiguration, but I'm not sure that's really the best option.
Fri, 21 Apr 2006 | Tags: puppet, errors, design
What Error Handling Means in Puppet
I was recently questioned about my post about error handling in Puppet; someone responded that exceptions are a potential solution. Rather than responding directly to him, I want to lay out some of the issues I've wrestled with in terms of error handling and explain why I haven't yet created what looks like a solution in Puppet.
First, let me point out that Puppet extensively uses exceptions internally, and it already does a good job of not letting those exceptions kill Puppet processes (although I'm sure that some can still leak out). The question isn't how to raise an exception, or whether I should try to trap those exceptions, it's how to react upon catching one.
For instance, given the following code:
file { "/export/docroots/reductivelabs.com/htdocs":
owner => luke, group => web, mode => 664, recurse => true
}
If the luke user does not exist, there will obviously be an error, but how should Puppet behave on such an error?
Should the other work still be done, or should that file just be ignored entirely? Should it be configurable? Should
it affect any other objects?
What about this code:
file { "/etc/apache":
source => "puppet://server/apps/apache", recurse => true
}
service { apache:
subscribe => file["/etc/apache"],
running => true
}
This results in the apache service getting restarted if any of the config files get updated. What should happen if
the service cannot successfully restart? Should the changes to the config files be rolled back? Should the system just
give up and notify the user? Should it retry? Again, should it be configurable? Should it affect other parts of the
system? If there are objects that further depend on the apache service, should those objects not be checked or
managed until the service is functional again?
There are basically four stages in managing a Puppet object in which there can be a failure:
At object creation time
When the object's current state is checked
When the object's state is corrected
When the object is refreshed because of a change to a dependency
Note that the first stage is what I call config time, because it happens when Puppet receives its configuration from the central server. All of the other stages are at run time, in that they can only happen when Puppet is actively doing something with the object. These stages are quite different, and some reactions will only make sense at certain stages. For instance, if an object was not able to be instantiated, then all of its dependent objects will also fail because they will list it as a dependency and it will not exist in the configuration.
There are important questions that affect how Puppet should react to a failure:
Is the error a central configuration error?
Is the error transient, or is it unfixable until a new configuration is provided?
Does the failed object have any dependencies?
Does the failed object have requirements that could have caused the failure?
Is the error the result of something Puppet did in this transaction?
Would rolling the current transaction back be likely to fix the error?
What should happen if the transaction is rolled back and there are still errors?
There are really only a few reactions that Puppet can take:
Log the error but ignore it. This is what Puppet does now, but one of the things Puppet currently does is kind of stupid: It removes all dependency relationships, so that dependent objects can still be configured, they just won't be restarted or anything. This might be okay, and it might be really dumb.
Ignore the object until a new configuration is received
Ignore the object and all downstream dependent objects until a new configuration is received
Do nothing until a new configuration is received
Roll back any changes to the object itself
Roll back any changes to objects in the object's dependency tree, and do not manage these objects until a new configuration is received
Roll back all changes, and do not do anything until a new configuration is received
Fail immediately and wait for a human to fix everything
For a long time, I've been planning on developing functionality around an
onerror metaparameter. This would
basically allow the user to specify how Puppet should behave in the case of an error. Any of the above reactions
could be specified, so valid values would be, I guess, ignore, ignoreobj, ignoretree, ignoreall, rollback,
rolldeps, rollall, and fail.
This seems somewhat straightforward, in that each of these reactions is relatively clear and explainable. I just haven't had the time to go through and completely characterize each of them and develop the bulwark to support all of them. I have transactions and I can roll them back, so I'm prepared to do this eventually, but I'm not currently convinced that users will be doing much with this in the short-to-medium term so it hasn't been a focus.
Fortunately it's probably not much work, so if someone decides it's a high priority it should be able to be done somewhat quickly. The only real problem is the variety of places in which the error can occur. If you were not able to even instantiate the object because of an error, you still need to know what that object's preference for error-handling is, and you need the reactivity to that error to be at a completely different part of the process.
I think I'm largely waiting until I have customer feedback to do anything here. There's a certain level of usage maturity that's required before any action really makes sense, so I'm not in a hurry.