How to Do SLAs Right

The way most people think about maintenance and support is wrong. Wrong in the same way that the healthcare system is broken. Here’s a brilliant way to fix it, so brilliant that I wish I’d thought it up myself. Someone told me that in ancient china, the village doctor would get paid a wage. Except when someone got sick. When someone gets sick, it is the doctor’s fault, and we should not pay them… right? It may sound strange but it is actually quite correct. In the modern western medical world, doctors are paid more if they prescribe more drugs. This is a royally bad idea that benefits only drug companies. Then, we pay insurance companies to cover our medical costs. The insurance companies cover more and more drugs, raising the fees, while the doctors prescribe more and more drugs, raising the costs. It’s a perfectly vicious cycle.
Image source: Pixabay
  In software, we build a product and then it goes live. Outages cost money, so our clients seek to mitigate the risk. Enter the modern maintenance and on-call support contract. A typical on-call support deal consists of a fixed fee for hours spent on-call, where the fee goes up if the response time must be shorter, or the hours are worse. On top of that, the client is charged an hourly fee per incident that also depends on the severity of the incident and the time it takes to solve it. The incentives here are all screwed up. A software company that is tempted, may put subtle bugs in the software, or just avoid fixing them, so they make more money from incidents. A timely and adequate response to an incident builds trust with the patient –sorry client– and they’ll just keep on coming back for more. What if we did things differently and fixed the incentives? They tried in China. And the results were very positive. I’d like to do the same thing for software and apply it at my company squads.com. The base cost of the contract is set up-front. Every time there is an incident, the client gets a reduction in the monthly fee. The longer the incident is open, the lower the monthly fee. The base price should be high enough to incentivize the doctors –sorry engineers– to reduce the risk of issues. How tolerant would such a system get? The higher the monthly fee, the higher the incentive to keep the money flowing. The client can play with the base fee and the penalty construction to ensure there is a good balance between sticks and carrots. If there are no incidents, the client could lower the base fee…you might think. But that’s the wrong kind of incentive again. The only reason to lower the fee would be that the cost of the risk goes down (meaning the company is winding down its business). In a growing business, the cost of the risk goes up, so the budget for mitigation should go up as well. Instead of lowering the base price, a client could hire hackers to try and cause an incident. The hackers could be incentivized to cause an outage at a low load window, making extra fun for the engineers. Nothing like a Saturday night outage to motivate the engineers to do better next time. So the rules of the game should be:
  • the client+team set the base fee,
  • the client+team set the penalty structure for incidents,
  • the team is free to do as little, or as much as they want in terms of prevention,
  • the client is free to intentionally cause incidents to reduce the fee via penalties.
It might be prudent to negotiate rules for changing the base fee and penalty structure over time as well. The more income security the doctors have if things go well, the more motivated they will be to do long-term prevention. This way maintenance and on-call support turns into an interesting game where the incentives are all pointing in the right direction: a well-hardened system. I’d love to hear your thoughts on this!