Abstraction Cascade

Tue 14 November 2017 by Moshe Zadka

(This is an adaptation of part of the talk Kurt Rose and I gave at PyBay 2017)

An abstraction cascade is a common anti-pattern in legacy system. It is useful to understand how to recognize it, how it tends to come about, how to fix it -- and most importantly, what kind of things will not fix it. The last one is important, in general, for anti-patterns in legacy systems: if the obvious fix worked, it would have been already dealt with, and would not be a common anti-pattern in legacy systems.

Recognition

The usual pattern for a abstraction cascade looks like complicated, ad-hoc, if/else sequence to decide which path to take. Here is example for a abstraction cascade for finding out a network address corresponding to a name:

def get_address(name):
    if name in services:
        if services[name].ip:
            return service[name].ip, service[name].port
        elif services[name].address:
            # Added for issue #2321
            if ':' in services[name].address:
               return service[name].address.split(':')
            else:
               # Fixes issues #6985
               # TODO: Hotfix, clean-up later
               return service[name].address, DEFAULT_PORT
    return dns_lookup(name), DEFAULT_PORT

History

At each step, it seems reasonable to make a specific change. Here is a typical way this kind of code comes about.

The initial version is reasonable: since DNS is a way to publish name to address mapping, why not use a standard?

def get_address(name):
    return dns_lookup(name), DEFAULT_PORT

Under load, an outage happened. There was no time to investigate how to configure DNS caching or TTL better -- so the "popular" services got added to a static list, with a "fast path" checking. This decision also makes sense: when an outage is ongoing, the top priority is to relieve the symptoms.

def get_address(name):
    if name in services:
        # Fixes issues #6985
        # TODO: Hotfix, clean-up later
        return service[name].address, DEFAULT_PORT
    return dns_lookup(name), DEFAULT_PORT

However, now the door has opened to add another path in the function. When the need to support multiple services on one host happened, it was easier to just add another path: after all, this was only for new services.

def get_address(name):
    if name in services:
        # Added for issue #2321
        if ':' in services[name].address:
            return service[name].address.split(':')
        else:
            # Fixes issues #6985
            # TODO: Hotfix, clean-up later
            return service[name].address, DEFAULT_PORT
    return dns_lookup(name), DEFAULT_PORT

When the change to IPv6 occured, splitting on : was not a safe operation -- so a separate field was added. Again, the existing "new" services (by now, many -- and not so new!) did not need to be touched:

def get_address(name):
    if name in services:
        if services[name].ip:
            return service[name].ip, service[name].port
        elif services[name].address:
            # Added for issue #2321
            if ':' in services[name].address:
               return service[name].address.split(':')
            else:
               # Fixes issues #6985
               # TODO: Hotfix, clean-up later
               return service[name].address, DEFAULT_PORT
    return dns_lookup(name), DEFAULT_PORT

Of course, this is typically just chapter one in the real story: having to adapt to multiple data centers, or multiple providers of services, will lead to more and more of these paths -- with nothing thrown away, because "some legacy service depends on it -- maybe".

Non-fixes

Fancier dispatch

Sometimes the ad-hoc if/else pattern is obscured by more abstract dispatch logic: for example, something that loops through classes and finds out which one is the right one:

class AbstractNameFinder(object):
    def matches(self, name):
        raise NotImplementedError()
    def get_address(self, name):
        raise NotImplementedError()
class DNS(AbstractNameFinder):
    def matches(self, name):
        return True
    def get_address(self, name):
        return dns_lookup(name), DEFAULT_PORT
class Local(AbstractNameFinder):
    def matches(self, name):
        return hasattr(services.get(name), 'ip')
    def get_address(self, name):
        return services[name].ip, services[name].port
finders = [Local(), DNS()]
def get_address(name):
    for finder in finders:
        if finder.match(name):
            return finder.get_address(name)

This is actually worse -- now the problem can be spread over multiple files, with no single place to fix it. While the code can be converted to this form, semi-mechanically, this does not fix the underlying issue -- and will actually make the problem continue on with force.

Pareto fix

The Pareto rule is that 80% of the problem is solved with 20% of the effort. It is often the case that a big percentage (in the stereotypical Pareto case, 80%) of the problem is not hard to fix.

For example, most services are actually listed in some file, and all we need to do is read this file in and look up based on that. The incentive to fix "80% of the problem" and leave the "20%" for later is strong.

However, usually the problem is that each of those "Pareto fixes" again makes the problem worse: since it is not a complete replacement, another dispatch layer needs to be built to support the "legacy solution". The new dispatch layer, the new solution, and the legacy solution all become part of the newest iteration of the legacy system, and cause the problem to be even worse.

Fixing 80% of the problem is useful for prototyping, since we are not sure we are solving the right problem and nothing better exists. However, in this case, the complete solution is necessary, so neither of these conditions hold.

Escape strategy

The reason this happens is because no single case can be removed. The way forward is not to add more cases, but to try and remove a single case. The first question to ask is: why was no case removed? Often, the reason is that there is no way to test whether removal is safe.

It might take some work to build infrastructure that will properly make removal safe. Unit tests are often not enough. Integration tests, as well, are sometimes not enough. Sometimes canary systems, sometimes feature flag systems, or, if worst comes to worst, a way to test and roll-back quickly if a problem is found.

Once it is possible to remove just one case (in our example above, maybe check what it would take to remove the case where we split on a colon, since this is clearly worse than just having separate attributes), thought needs to be given to which case is best.

Sometimes, there is more than one case that is really needed: some inherent, deep, trade-off. However, it is rare to need more than two, and almost unheard of to need more than three. Start removing unneeded cases one by one.

Conclusion

When seeing an abstraction cascade, there is a temptation to "clean it up": but most obvious clean-ups end up making it worse. However, by understanding how it came to be, and finding a way to remove cases, it is possible to do away with it.