By: Dina Levitan (dinalevitan.com) and Lynn Cyrin (lynncyrin.me)
Site Reliability Engineering (SRE), as an organization, has a reputation as a gatekeeper, guarding and protecting production systems. Because of this, I sometimes think of SREs as the "Knights Who Say No." Yet, is this really the reputation we want for SRE?
Let's step back to understand why SREs got this reputation in the first place. The primary goal of a production-minded engineer is to make sure that the system will be up and available for end users, per SLO guarantees. As Ben Treynor says, "The incentives of a team with operational duties is to ensure that the thing doesn’t blow up on their watch." And sometimes, in order to keep the queries flowing and the pagers silent, it can be important to narrow scope and ensure that system changes stay within what is supportable and maintainable.
However, as production-minded engineers, we can't lose sight of the fact that our goal is ALSO to give users new products and services, to enable the product development team to innovate and introduce new features. As stated in the Google SRE Workbook, "Simply put, SRE principles aim to maximize the engineering velocity of developer teams while keeping products reliable." Maximizing engineering velocity requires an enablement approach: the SRE goal should be to extend the capabilities of coworkers rather than constrain them.
So along these lines, when an engineer wants to launch a new product feature that does all kinds of unorthodox things in the system, SREs should view this from the perspective of: "How can we work together to make this crazy product idea feasible?" After all, SREs and product developers are partners, not adversaries.
But what happens too often, is that new ideas get instinctively shut down with a comment along the lines of "this feature introduces too much technical complexity" or "you can't abuse the system this way!"
If you respond with "no" to the majority of people's suggestions, you become an adversarial boundary to overcome. The troll guarding the bridge to the product-release castle, who must be defeated with the Perfectly Crafted Proposal complete with beans perfectly counted and sorted.
My proposal for the DevOps community: move from the gut nay-saying reaction that generates stop energy to an improv-inspired "Yes, and..." mindset. Practically, this looks like saying: "let's think together how we can achieve this goal", or "let's talk about how to design this feature to minimize added complexity," even if that's not the first thing that comes to mind.
As someone with years of SRE training, the objection comes to mind: "but sometimes I HAVE to say no!" And yes, indeed, sometimes an idea has more negatives than positives. For this, I really like the approach of not really saying no directly. In fact, for purely technical conversations, in many cases you can avoid saying no altogether, and instead come from the perspective: "if we decide to go this route, here are the other consequences of that decision." In other words, sharing "The Opportunity Cost of the Yes". For example, "yes, we can do this, and it will cost 1000 widgets". Or, "if we choose to move forward with this approach, we will need to deprioritize these 6 other projects in order to ensure launch safety." Or, "that's interesting! Here's another option with a 10x performance improvement." With this language, SRE and product development are partners, working together to reach the right decision for the overall product.
At the same time, within the SRE/developer relationship, sometimes boundaries do need to be set and expectations managed accordingly. Think, "no, we really can't launch this risky experiment without a simple way to turn it off, because an unexpected failure mode could break the entire system for users everywhere." When SREs do need to lay down the law, we are relying on our developer counterparts to understand, and know that we are coming from a place of prioritizing a reliable system and user experience. If we want our counterparts to hear what we have to say, however, then we need to be someone with the reputation of trying to help and partner with them. For this reason, producing a "No" that the developer doesn't understand isn't super useful. If someone asks that developer why they aren't doing something, at best they can then respond "{Insert name here} said No."
Instead of "SREs...the knights who say No", let's be "SREs...the knights who say 'Yes, and…', who help make complex distributed systems possible". The current culture has far-reaching effects beyond just the developer-SRE relationship. Years ago, as a new SRE, I internalized that my job was to be generally push-backy in the name of "protecting production." It was only in the last couple of years that I realized: "Wait a second. That's not who I am. Doing this feels bad. This is not a professional culture that works for me."
I don't think that Google intended for this to be the prevailing culture when the SRE practice was first developed; rather, that it resulted as a reaction to the operational overload and general disempowerment that operationally-focused systems engineering teams faced at that time. But times have changed, and the role of the SRE has become clearer and more well-understood over the years. SREs no longer need to produce a harsh "No" just to be heard, and this overcorrection is no longer productive. For the sake of SRE as a whole, we need to make it a field that's more constructive and forward-looking, where we take an enablement approach to increase developer velocity while continuing to prioritize reliable user experiences.
Original image source: https://en.wikipedia.org/wiki/Knights_Who_Say_%22Ni!%22
Comments