I know I promised another "Anomaly Detection"-type of blog, but I lied. That's because I’ve been thinking a lot about control systems lately. Could be because I spent that last three weeks immobilized due to an operation. Also, I guess, once an engineer always an engineer. But I digress. Before I dig deeper into what that has to do with anomaly detection and DevOps, though, let me walk quickly through a very basic review of Control Systems 101. Think of what you need to do to keep your house at a comfortable temperature. You’re going to need a heater (controlled system with a heat output), a way to increase or lower the heat output of the heater (maybe a knob for the gas or electricity, aka a controller) and a way to measure the temperature of the house (a thermometer, aka a sensor). One way of getting to the ideal temperature is to keep increasing the input to the heater until you get to a comfortable temperature and then just lock it there. That’s what engineers like to call an open loop control system.
With an open loop heat control system, you can keep your house at your preferred temperature for a while. Until someone opens a door or a window, or the weather outside warms up or cools down, or the gas valve starts acting up, of course. In which case the house temperature starts drifting away from the ideal temperature since the heater is locked to one input. If you’re lucky you can catch it before the heater burns down the house, or at least before the smoke detector wakes you up in the middle of the night! What would be ideal is if we had an automated way to keep fiddling with the input to the heater as the house temperature rises or drops, in order to keep the house at the optimal temperature. I’m of course talking about a thermostat. That system is what engineers like to call a closed loop control system. In order to have a closed loop system, control systems theory introduces the concepts of feedback and tracking error against a reference signal. The output from the sensor is fed back (feedback, I know, right?) into a differencing component that generates a correction based on the difference between this output and a reference (desired state) signal. That correction is then fed into the controller and changes are made to the system to get it back on track.
You can see where I’m going with this. In essence, as a DevOps team, we’re trying to maintain our systems in an ideal state, using a whole bunch of tools that affect the state of our systems, sometimes in unpredictable ways. Let’s take a look at the closed loop diagram with a different set of labels.
We have a lot of the tools and components that make up a control system, but I would go out on a limb and assert that our current situation is more akin to an open loop control system than a closed loop: we get our systems running just the way we like them then we do something (provision new machines, update packages, deploy new versions, etc) or some external event happens that perturbs the system and it starts drifting or misbehaving. If we’re lucky, we get to notice the results and take action before the 2am pagerduty alert comes in and we start troubleshooting to take corrective action. What we’re missing is a systematic way to generate the feedback and tracking error differencing components that will give us the closed loop system so we can get one step closer to better full-on automation. Granted, some tools like Puppet/Chef/CFEngine do have some features that approximate aspects of a closed loop system if they’re run regularly. But many people don’t run these every 30 min, and also these tools don’t manage everything on a system so there are still lots of areas that are “open loop”. Once we have that tracking error component, the next step of course is to automate what corrective action to take based on that difference between desired state and current state: presto, self-healing systems!
This is not a completely unbiased post, but it says something about why we’re doing what we’re doing at Metafor. As an example of a first cut at a self-healing system, one of our beta users is doing something really cool. He’s scheduling a system-wide differencing on a regular basis, then feeding the resulting difference report into some logic he developed to make decisions about corrective actions. One of the easily automatable (is that a word?) actions of course is to terminate the misbehaving server and rebuild one. Another would be to use something like Puppet or Chef to reset it. There are other possible actions of course, but he’s getting impressive results. I’d be very interested in hearing from other people out there who are achieving higher levels of automation by closing that feedback loop!
Next blog, more Anomaly Detection! I promise!