Site Reliability Engineering

Site Reliability Engineering Challenges and Solutions

What is Site Reliability Engineering?

Site Reliability Engineering (SRE) is a practice that applies software development skills and mindset to IT operations, with the goal of improving the reliability of high-scale systems through automation and continuous integration and delivery. The concept originated with Google in the early 2000s and was documented in a book with the same name, Site Reliability Engineering. SRE shares many governing concepts with DevOps—both domains rely on a culture of sharing, metrics and automation. SRE is often thought of as a specific implementation of DevOps. The role of the SRE is common in digital enterprises and gaining momentum in traditional IT teams. Part systems administrator, part second tier support and part developer, SREs require a personality that is by nature inquisitive, always acquiring new skills, asking questions, and solving problems by embracing new tools and automation.

SRE Challenges

An SRE contributes to a business by automating tasks with the aim to eliminate unnecessary work and roles, and helping to reduce overall cost through optimizing resources and improving mean time to repair (MTTR).

Key areas of SRE focus are:

  • Reliability—Maintaining a high level of network and application availability
  • Monitoring—Implementing performance metrics and establish benchmarks in order to monitor the systems.
  • Alerting—Readily identifying any issues and ensure that there is a closed loop support process in place to solve them.
  • Infrastructure—Understanding cloud and physical infrastructure scalability and limitations.
  • Application Engineering—Understanding all application requirements including testing and readiness needs.
  • Debugging—Understanding the systems, log files, code, use case and troubleshooting, so they can debug as needed.
  • Security—Understanding common security issues, as well a tracking and addressing vulnerabilities, to ensure the systems are properly secured.
  • Best Practices Documentation—Prescribing solutions, production support playbooks, etc.
  • Best Practice Training—Promoting and evangelizing SRE best practices through production readiness reviews, blameless postmortems, technical talks, and tooling.

There are other resource domains that overlap with the SRE's role such as DevOps, IT Service Management (ITSM), Agile Software Development Life Cycle (SDLC) and other organizational frameworks. SRE and DevOps/NetDevOps teams are complementary and by providing monitoring solutions that address the needs of both, information is facilitated across teams so that collaborative troubleshooting quickly leads to problem resolution.

SRE Best Practices

Explained using simple terms, SREs run services—with a set of networked systems, operated for users, who may be internal or external—and are ultimately responsible for the health of these services. Successfully operating a service entails a wide range of activities: developing monitoring capabilities, planning capacity, responding to incidents, ensuring the root causes of outages are addressed, and so on.

SRE represents a break from existing industry best practices for managing large, complicated services. Influenced originally by software engineering, SRE methodologies have become much more a set of principles, a set of practices, a set of incentives, that are a field of endeavor within the larger DevOps discipline.

Key SRE best practices include:

  • Engaging in and improving the whole lifecycle of services—from inception and design, through deployment, operation and refinement.
  • Supporting services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
  • Maintaining services once they are live by measuring and monitoring availability, latency and overall system health.
  • Scaling systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
  • Practicing sustainable incident response and blameless postmortems.

Network Intelligence technology addresses many of the challenges associated with the pursuit of SRE best practices. To assure optimal network performance, network operations and SRE teams need detailed and accurate network and application insight to ensure system availability and performance.