Day 2 operations refers to the ongoing tasks and responsibilities involved in managing and maintaining the reliability, availability, and performance of ongoing software development and production environments. In short - Day 2 operations keep runtime from becoming downtime.
So Why ‘Day 2’ Operations?
The concept of an application lifecycle has had many iterations, and various levels of granularity. Some developer-minded folks often concentrate on the early stages of “design, deploy, build.” Operations and IT teams often think in terms of “Dev, test, prod.” Cloud DevOps or platform teams often organize efforts around stages of “build, deploy, run.”
Each of these are useful abstractions, but all are slightly different and make communication less clear - not more. In terms of modern app development, we can help to organize effort and build tooling around higher level concepts that help different teams all align conceptually on the work that needs to happen to successfully put an application into use.
Thus, the introduction of our metaphorical app lifecycle “days” (which of course might actually take just minutes, or last for years):
Day 0 - Design and build stage
Day 1 - Infrastructure and code deployment stage
Day 2 - Runtime
As apps have moved to cloud-based, software-defined infrastructure and tooling, a lot of emphasis has been placed on Day 0 and Day 1 to ‘get things going’ and set up the underlying systems that we require to manage containerized code. CICD systems, cloud/Terraform configuration, Kubernetes Admission Control, Servishmesh policy—all of these and more all all part of the infrastructure-as-code that has to be put in place for applications to run. Configuration validation, testing, and policy-as-code guardrails, all so-called “Shift left” efforts, are critical to ensuring cloud infrastructure will operate correctly.
…However, when an app is deployed, and uptime matters, then Day 2 Operations become critical.
Day 2 operations involve various tasks aimed at ensuring the reliable, predictable operation of the entire app ecosystem, and addressing issues that arise as quickly as possible. To that end, here are the six key aspects of day 2 operations for an SRE:
1. Monitoring and Alerting
First and foremost, it’s critical to identify health and security anomalies to prevent platform or application downtime. Performance management solutions help you monitor of key infrastructure metrics such as CPU and memory utilization, but it’s equally important to monitor the runtime behavior of your applications. This includes ensuring consistent use in third-party components, running processes, and network connections. With the right observability in place, systems will be able to automatically monitor typical app behavior, and only alert human teams when a critical action is required.
2. Configuration Management
While the configuration of the infrastructure was considered in Day 0, in your Day 2 Operations it is important to validate that the running state is what was expected, based on the intended/desired state defined before deployment. When apps are pushed into live environments, deployment configuration, infrastructure settings, and service dependencies can all cause unexpected app behavior. It is important for runtime changes to be properly funneled back into CICD configuration changes, or at minimum documented, version-controlled, and tested before being re-deployed to production. And all pre-deployment guardrails should be validated with running configuration to ensure guardrails are working as expected, automate configuration processes, and and reduce the risk of error.
3. Performance Management
It is important to continuously optimize performance to ensure applications meet users’ expected experience as economically as possible. It is important to identify performance bottlenecks, conduct profiling and tuning activities, and collaborate with development teams to improve code efficiency that reduces latency.
4. Capacity Sizing and Scaling
While your Day 0 planning accounts for expected capacity requirements, it is important to review the actual system and application usage patterns, along with performance metrics, to understand more realistic capacity needs and to project future growth.
5. Incident Management
When incidents occur, SREs play a critical role in incident response and management. They participate in resolving incidents by investigating root causes, coordinating with different teams (e.g. development, security) to build fixes, and restoring the system to normal operation as quickly as possible.
6. Continuous Improvement
By performing post incident analysis, blameless post-mortems, and gaining insights from past experiences, SREs look for methods to continuously improve the stability of the application, the efficiency of the platform, and the security of the environment by seeking opportunities to remove repetitive tasks with automation and set the appropriate processes for developers, infrastructure teams, and security teams. This may include implementing chaos engineering, fault injection, and resilience testing.
Overall, day 2 operations for an SRE involve a combination of proactive monitoring, incident response, capacity planning, performance optimization, configuration management, and continuous improvement efforts to ensure the reliability and availability of the system over its lifecycle. And because day 2 can last for months (or longer, in the case of some applications!) the sprawl of runtime variables is always growing, making automation more critical in runtime than in any other phase of deployment.
Spyderbat for your Day 2 Operations
The Spyderbat Cloud Native Runtime Security platform is a powerful ally for SREs in their Day 2 Operations. Providing SREs with an early warning system identifying runtime execution issues and a DVR-like capability to record and pinpoint root cause, Spyderbat offsets the time otherwise spent hunting through system, custom application, and cloud-provided esoteric logs to provide more time spent on developing automation and optimization.