One thing I've always wished for in the config management systems I use is the ability for modules to rollback changes in the event of failure. I can't count on both hands the number of times that coworkers and myself have almost shot ourselves in the foot doing something stupid that either failed in a way the configuration management tool wasn't aware of (so it continued a run that should have stopped), or ran automation that locked us out of multiple boxes and needed hours of manual reversion.
You can kind of hack it into some of them with creative handler use, but there isn't anything out there that works well, and I think that having some kind of run-time sanity check for resources would be a compelling and widely used feature. Here is a rough sketch of what I'm after (using an imaginary network provider that doesn't exist yet):
class NetExample(Role): def set_resources(self): def test_function(): """ Test the network changes by pinging the default gateway Return true if the test passes, return false if not """ test_result = os.system("ping -c 1 10.1.2.1") if test_result: return True else: return False return Resources( NetInterface( name="eth0", boot_proto="static", ipv4_cidr='10.1.2.3/24', test_callback=test_function, ) )
This probably isn't a good way to implement this, but hopefully it illustrates what I'm thinking of.
Understandably, this increases the complexity of any module that implements this as the module needs to know how to put things back the way they were before. Ideally you are storing all the values you are using to compare against when you run plan so you still have access to what the previous state was, and for the case of something like network interfaces this is probably sufficient to return to the old configuration, but I could see some edge cases where unintended changes are made and opsmop may not have enough data to bring the old state back (Thinking things like running package rollbacks where dependencies were also updated.)
With that being said, I could think of a lot of use cases this would work for such as:
- GETing a health check URL after changing an HTTP server config and expecting a certain response code and/or response body, put the old config back in place and reload if a failure occurs.
- (As above) Pinging a server to verify you haven't locked yourself out of the box, restore the old network config if a failure occurs.
- Test port access following local firewall change, restore the old firewall ruleset if a failure occurs.
The alternative, I suppose, is to implement this just in the modules that need it. However given the target audience of this project, I think that putting this power in the hands of the users will result in far more comprehensive and stable automation infrastructure.