A Guide to Incident Response for Site Reliability Engineers (SRE)

Feb 10, 2023 9:55:00 AM

Want to see Spyderbat in action?

As an SRE, you combine development and operations principles to focus on developing code that automates IT operations tasks such as systems management and incident response. Today’s cloud environments enable near unlimited scalability in simple – often automated – processes. In these environments, traditional operations management practices based on manual intervention have become unsustainable. SREs apply software-based solutions to keep operations up to speed while minimizing service interruptions. In the words of Ben Treynor Sloss, VP of engineering at Google and coiner of the term, "SRE is what happens when you ask a software engineer to design an operations team."

Fighting Firestorms with Fire - Best practices for SREs Incident Response

Incident response is a critical area of responsibility for SREs. When unexpected runtime incidents occur throughout the software development lifecycle (SDLC), site reliability engineers who have a practical understanding of how code translates into downstream behaviors can radically reduce operational response times. However, the true incident response value of SRE accrues over time as lessons learned from previous incidents feed back into the development process, allowing software engineers to build operations insights and risk mitigation into the code they write.

Step 1: Time Management

To ensure you serve this purpose and don’t simply become a full-time conscript of operations teams, Google’s guide to SRE best practices stipulates that site reliability engineers may spend no more than 50% of their time on operations tasks but does not prohibit the inverse. The lesson this rule implies is that built-in prevention, automation, and codified response procedures ultimately trump the value of time spent solving day-to-day operational issues. This is challenging since incidents, by design, are interrupt-driven. While you are not able to control when an incident will arise, you can time box your efforts based on priority. While any issue impacting application uptime requires all-hands-on-deck, anything less can be compartmentalized to ensure you are not stuck in incident response purgatory.

It is also important to begin collaborating as soon as possible. While you do not want to be a drain on other people’s resources, by bringing in the right people (e.g., the developer who created the code, the architect who setup the platform configuration, etc.), you can short circuit your attempt to reverse engineer what is happening with someone who understands what the expected outcome should be and what is causing the new outcome.

Step 2: Automation Focus

In our blog, “What is MTTK and how does it Relate to Container Security”, we discuss the concept of “Mean-Time-To-Know”. MTTK, or how long it takes to understand everything that occurred so it can be resolved, is the long pole when it comes to incident response. When developing automation, focus on small, repetitive tasks that help gather the information needed to reduce the Time-To-Know. While it's easy to get pulled into the desire to automate a full use case end-to-end, you will find more immediate value automating smaller, frequently performed tasks in your incident response.

Step 3: Create Dashboards with Thresholds You Can Trust

For most engineers, the ‘single pane of glass’ concept usually becomes a ‘single pain of glass’. The reason is the process of collecting data into a single location, parsing/processing that data into usable metrics, and converting the metrics into a helpful dashboard widget becomes a management nightmare.

Instead, focus on key metrics on primary data (what we call ground-truth data in our blog Kubernetes Security Incidents Are on the Rise– Here’s What You Can Do About It). While alerts, events etc., require some interpretation and inevitably involve false positives, ground-truth data is a fact, involving real states and events such as CPU, memory, new processes, and new network connections. As you don't want to stare at a dashboard all day, set thresholds that make sense to you and your organization so you can investigate very specific resources when alerts trip or thresholds are exceeded.

Step 4: Process-Focused Postmortems

Postmortems for incidents that directly affect end-users – and possibly an organization’s bottom line – can easily become culprit-finding missions. However, you can deliver the most value in your capacity as an SRE by effecting preventative change in systems and processes. With team members in both development and operations, you’re in a uniquely advantaged position to perform thorough root cause analysis and feed new insights back into early development stages as monitoring enhancements or policy-as-code. And when postmortems identify real attacks behind disruptive incidents, SREs can glean actionable insights not just from parsing what happened. As we discuss in our blog Contextualized Runtime Security for Containerized Environments, context is the swiftest investigative shortcut and for the postmortem process, that means mapping the paths successful exploits took to gain a foothold and the processes through which incident response teams ultimately discovered conscious intent.

Runtime Incident Response with Spyderbat

When unexpected behaviors in your applications set incident response procedures in motion, it isn’t necessarily clear at the outset whether the incident represents an attack or an internal anomaly. The quicker teams definitively answer this question, the faster – and more effectively – they can resolve the incident. However, in cloud-native and containerized environments, runtime visibility into applications distributed across microservices is limited at best and often altogether lacking.

Spyderbat’s cloud-native runtime monitoring platform decisively eliminates the incident response lag time that accumulates while security issues are sorted from other operational concerns. Based on ground-truth eBPF technology, Spyderbat’s interface renders all system activities in a temporal causal graph that traces system and container s activities and processes to their root causes in a matter of moments.

Schedule an interactive workshop with your team to learn more about enhancing your organization’s SRE and incident response capabilities, contact Spyderbat