Test/rollback functions in modules?


#1

One thing I've always wished for in the config management systems I use is the ability for modules to rollback changes in the event of failure. I can't count on both hands the number of times that coworkers and myself have almost shot ourselves in the foot doing something stupid that either failed in a way the configuration management tool wasn't aware of (so it continued a run that should have stopped), or ran automation that locked us out of multiple boxes and needed hours of manual reversion.

You can kind of hack it into some of them with creative handler use, but there isn't anything out there that works well, and I think that having some kind of run-time sanity check for resources would be a compelling and widely used feature. Here is a rough sketch of what I'm after (using an imaginary network provider that doesn't exist yet):

class NetExample(Role):

def set_resources(self):

    def test_function():
        """
        Test the network changes by pinging the default gateway
        Return true if the test passes, return false if not
        """
        test_result = os.system("ping -c 1 10.1.2.1")
        if test_result:
            return True
        else:
            return False

    return Resources(
        NetInterface(
            name="eth0",
            boot_proto="static",
            ipv4_cidr='10.1.2.3/24',
            test_callback=test_function,
        )
    )

This probably isn't a good way to implement this, but hopefully it illustrates what I'm thinking of.

Understandably, this increases the complexity of any module that implements this as the module needs to know how to put things back the way they were before. Ideally you are storing all the values you are using to compare against when you run plan so you still have access to what the previous state was, and for the case of something like network interfaces this is probably sufficient to return to the old configuration, but I could see some edge cases where unintended changes are made and opsmop may not have enough data to bring the old state back (Thinking things like running package rollbacks where dependencies were also updated.)

With that being said, I could think of a lot of use cases this would work for such as:

  • GETing a health check URL after changing an HTTP server config and expecting a certain response code and/or response body, put the old config back in place and reload if a failure occurs.
  • (As above) Pinging a server to verify you haven't locked yourself out of the box, restore the old network config if a failure occurs.
  • Test port access following local firewall change, restore the old firewall ruleset if a failure occurs.

The alternative, I suppose, is to implement this just in the modules that need it. However given the target audience of this project, I think that putting this power in the hands of the users will result in far more comprehensive and stable automation infrastructure.


#2

Hi @graysonhead,

Thanks for posting!

Generally, I'm a strong advocate for immutable systems, and when building images most folks there will not be interested in rollback. Of course that doesn't help with your example at all.

I definitely see the value in what you describe for doing something that could potentially lock you out, for instance, I had the opportunity to see a big MSP's SSH rotation playbooks at one point and that was programmed VERY defensively, and the act of programming it was exceptionally difficult in a declarative sort of way with the tool we are using.

Right now, there's already something called a pre() and post() hook per resource, and Role is in fact a resource, which looks like this:

class FooRole(Role):

    def pre(self):
        # anything

    def set_resources(self):
        return Resources(...)

    def post(self):
       # do test here

Except the problem is that post() is just a dumb function (it would be great for doing anything else), and has no way to add resources at the moment. So this doesn't help you yet...

Hypothetically, maybe we could incorporate testing like this?

class FooRole(object):

     def test(self):
            # ...
            raise Rollback("this is a message that will show up")

     def set_rollback_resources(self):
           # this only runs if test is false
           return Resources(
           )

And we'd have to change the Executor code and Visitor code however needed to make that possible.

There's probably also the question of whether the handlers should run (I'd say yes?) and whether the other roles in the policy should continue (maybe no?)

IDK, thoughts on syntax and behaviors?


#3

This would work as long as there is a way to re-use discovered facts from before the run was made, that way you can be relatively assured you are putting the machine back the way it was before. Assuming those facts are accessible from the set_rollback_resources context, I actually like this solution better since there are a lot of use cases I can see where rolling back might require a different resource than the one that caused the problem (If there is a different provider for bonds vs interfaces, then you would need to set up an interface with a sane configuration in order to recover from a failed bond).

I assume something like this would work:

Once some network interface fact providers are built out it would allow users to put interfaces back to their old configs assuming there was some consistency in their environments to start with. This could also work for the SSH rotation example assuming there was a fact provider that covered everything that they changed during a rotation.

I think those are sane defaults, I can't think of any scenarios where I wouldn't want to re-run the handlers, but I wouldn't be surprised if someone came up with one. As for the policy continuing, I think that a "rolled back" failure should be treated like a failure in every other aspect, although messaging to the user should probably be different (either way it should show up as an error IMO, but if your recovery resource(s) also fail then it should probably be made crystal clear to the user that they need to manually intervene and figure out what state their system was left in.)

I like this better than what I came up with, its more flexible and just as straightforward and doesn't increase Provider complexity. It puts more onus on the user to figure out their own rollback strategy, but that also makes it more flexible.


#4

Cool!

I like this proposal and I'm glad you shared it, as honestly, having the config management model all solved makes life somewhat boring :)

I'll leave this thread here for a while to see if others have ideas on implementations... I was doing some thinking today about how I might architect push mode, so I think that's going to be my next big endeavor after some output cleanup.

While you're welcome to have a go at it, I'm suspecting I'll need to add this one as I suspect nobody quote groks the whole architecture just yet and this one may be a little brain-warping to pull off.

(If interested, what we are going to have to do is ALWAYS add every rollback resource to the graph while walking the resources, but somehow flag those resources as "for rollback" so that the executor code knows they are usually not executed. This would require a common internal field for each resource. Then, the executor will need to understand how to run all of those resources, catching exceptions along the way, and then deciding if rollbacks need to happen. If an error occured, it would need to mark that, then run any remaining flagged resources, and finally exiting with a fatal error before clearing the role. And of course it's going to take some docs and an example file somewhere.

I want to get either push or pull mode done this month - and honestly I probably want to do PUSH sooner just because it's easy, and many people doing pull can already script that up easily on their own for the short term, so this is probably looking to be a January kind of feature.

If anyone else has ideas on this, feel free to weigh in also!

Added a ticket to track this here - https://github.com/opsmop/opsmop/issues/16