SRE Domain 2: Service Level Objectives (16%) - Complete Study Guide 2027

Q: What's the difference between SLIs, SLOs, and SLAs?

SLIs are metrics that measure service performance, SLOs are internal targets based on those metrics, and SLAs are external contractual commitments. SLOs should be stricter than SLAs to provide operational buffer.

Q: How do I calculate error budgets from SLOs?

Error budget equals 100% minus the SLO percentage. For a 99.9% availability SLO, the error budget is 0.1%, which translates to specific downtime allowances based on the time window (43.2 minutes per month for a 30-day window).

Q: Should I use rolling or calendar time windows for SLOs?

Rolling windows provide more consistent measurement but are complex to implement. Calendar windows are simpler but can hide systematic issues. Choose based on your operational capabilities and business requirements.

Q: How many SLOs should a service have?

Start with 2-3 SLOs covering the most critical user journeys. Too many SLOs dilute focus and make it difficult to prioritize reliability work. You can add more as your SRE practice matures.

Q: What happens when error budgets are exhausted?

Error budget policies should define specific responses, typically including halting feature releases, focusing engineering effort on reliability improvements, and escalating to appropriate stakeholders until service reliability is restored.

Table of Contents

Understanding Domain 2: Service Level Objectives
Service Level Indicators (SLIs): Foundation of Measurement
Service Level Objectives (SLOs): Setting Reliability Targets
Service Level Agreements (SLAs): Business Commitments
Error Budgets: Managing Risk and Innovation
SLO Implementation Strategies
Domain 2 Exam Preparation
Common SLO Implementation Pitfalls
Real-World SLO Examples and Case Studies
Frequently Asked Questions

Understanding Domain 2: Service Level Objectives

Domain 2 of the SRE Foundation exam focuses on Service Level Objectives (SLOs) and represents 16% of the exam content, making it one of the most critical areas for certification success. This domain builds upon the fundamental SRE principles covered in SRE Domain 1: SRE Principles and Practices and forms the backbone of reliability engineering practices.

16%

Exam Weight

6-7

Expected Questions

3-4

Key Concepts

Service Level Objectives represent the quantitative reliability targets that SRE teams use to balance system reliability with the pace of innovation. Understanding SLOs is crucial not only for passing the exam but also for implementing effective SRE practices in real-world environments. As detailed in our comprehensive SRE Exam Domains guide, this domain requires both theoretical knowledge and practical application understanding.

Domain 2 Core Focus Areas

The exam tests your understanding of Service Level Indicators (SLIs), Service Level Objectives (SLOs), Service Level Agreements (SLAs), error budgets, and the practical implementation of these concepts in production environments.

Service Level Indicators (SLIs): Foundation of Measurement

Service Level Indicators (SLIs) are the fundamental metrics that quantify the level of service provided to users. They serve as the building blocks for all reliability measurements and must be carefully selected to reflect the user experience accurately.

Types of SLIs

SLIs typically fall into several categories, each addressing different aspects of service quality:

Availability SLIs: Measure whether the service is operational and accessible to users
Latency SLIs: Track response times and processing delays
Quality SLIs: Assess the correctness and completeness of service responses
Throughput SLIs: Monitor the service's capacity to handle requests

SLI Type	Measurement Method	Example Metric	User Impact
Availability	Request success ratio	99.9% of requests return HTTP 200	Service accessibility
Latency	Response time percentiles	95th percentile < 200ms	User experience speed
Quality	Correct response ratio	99.99% correct search results	Service reliability
Throughput	Requests per second	Handle 10,000 QPS	Service capacity

SLI Selection Criteria

Effective SLI selection requires careful consideration of user journey and business impact. The most important SLIs should directly correlate with user satisfaction and business objectives. When preparing for the exam, focus on understanding how to choose SLIs that are:

User-centric: Reflect actual user experience rather than internal system metrics
Measurable: Can be consistently and accurately monitored
Actionable: Provide clear signals for when intervention is needed
Proportional: Scale appropriately with system usage and complexity

Common SLI Selection Mistakes

Avoid selecting SLIs based solely on what's easy to measure. Internal metrics like CPU utilization or memory usage rarely correlate directly with user experience and should not be primary SLIs.

Service Level Objectives (SLOs): Setting Reliability Targets

Service Level Objectives transform SLIs into specific, measurable targets that define acceptable service performance. SLOs represent the contract between service providers and their users, establishing clear expectations for reliability.

SLO Structure and Components

Well-defined SLOs contain several essential elements that exam candidates must understand:

Metric Definition: The specific SLI being measured
Target Value: The numerical threshold for acceptable performance
Time Window: The period over which the objective is evaluated
Measurement Method: How the metric is calculated and aggregated

For example, a complete SLO might state: "99.9% of HTTP requests will complete successfully within a 30-day rolling window, measured from the load balancer logs."

SLO Time Windows

The choice of time window significantly impacts SLO behavior and user experience. Understanding different time window approaches is crucial for exam success:

Window Type	Calculation Method	Advantages	Disadvantages
Rolling Window	Continuous calculation over fixed period	Smooth, consistent measurement	Complex to implement
Calendar Window	Reset at fixed intervals (monthly/quarterly)	Simple to understand and implement	Can hide systematic issues
Request-based	Percentage of good requests	Directly reflects user experience	May not account for traffic patterns

SLO Best Practices for Exam Success

Remember that SLOs should be slightly more permissive than SLAs to provide operational buffer. They should also be achievable with current system capabilities while driving meaningful reliability improvements.

Service Level Agreements (SLAs): Business Commitments

Service Level Agreements (SLAs) are contractual commitments that typically include consequences for non-compliance. Unlike SLOs, which are internal targets, SLAs represent external promises with business implications.

SLA vs SLO Relationship

The relationship between SLAs and SLOs is fundamental to SRE practice and frequently tested on the exam. Key principles include:

SLOs should be stricter than SLAs: Provides operational buffer to avoid SLA violations
SLAs define consequences: Business penalties or compensation for service failures
SLOs drive operational behavior: Internal targets that guide engineering decisions
Multiple SLOs may support one SLA: Internal objectives ensure external commitments are met

This hierarchical relationship ensures that internal teams have early warning systems before external commitments are at risk. Understanding this relationship is crucial for questions about balancing reliability investments and business requirements.

Error Budgets: Managing Risk and Innovation

Error budgets represent one of the most innovative concepts in SRE and are heavily emphasized in the exam. They quantify the acceptable level of unreliability and provide a framework for balancing reliability with feature velocity.

Error Budget Calculation

Error budgets are derived directly from SLOs and represent the allowed failure rate. For example:

99.9% availability SLO = 0.1% error budget
Over 30 days: 43.2 minutes of downtime allowed
Over 1 million requests: 1,000 failed requests allowed

99.9%

Availability Target

43.2

Minutes Downtime/Month

0.1%

Error Budget

Error Budget Policy

Error budget policies define organizational responses to budget consumption and are critical for exam understanding. These policies typically specify:

Escalation thresholds: When to alert different stakeholders
Response procedures: Actions required at different budget levels
Feature release gates: When to halt new deployments
Recovery procedures: How to restore reliability when budgets are exhausted

Error Budget as Engineering Tool

Error budgets provide objective criteria for decision-making, removing emotional arguments about reliability versus feature delivery. When budgets are healthy, teams can take more risks. When exhausted, teams must focus on reliability.

SLO Implementation Strategies

Successful SLO implementation requires careful planning and gradual rollout. The exam tests understanding of practical implementation challenges and solutions, making this knowledge crucial for certification success.

Implementation Phases

Effective SLO implementation typically follows a structured approach:

Discovery Phase: Identify critical user journeys and pain points
Measurement Phase: Implement monitoring and establish baseline SLIs
Target Setting Phase: Define achievable but meaningful SLOs
Policy Development Phase: Create error budget policies and response procedures
Integration Phase: Incorporate SLOs into development and operational processes

Understanding this progression is important for exam questions about implementing SRE practices in organizations. As covered in our exam difficulty guide, these implementation scenarios frequently appear in practice questions.

Organizational Alignment

SLO implementation success depends heavily on organizational buy-in and alignment. Key factors include:

Stakeholder engagement: Involving product, engineering, and business teams
Clear communication: Explaining SLO benefits and trade-offs
Gradual adoption: Starting with pilot services before full rollout
Regular review: Periodic assessment and refinement of targets

Domain 2 Exam Preparation

Successfully preparing for Domain 2 requires both conceptual understanding and practical application knowledge. The open-book format means you should focus on understanding concepts rather than memorizing formulas, as detailed in our practice test platform.

Key Study Areas

Focus your preparation on these high-yield topics:

SLI selection criteria and best practices
SLO target setting methodologies
Error budget calculation and policy development
Time window selection and trade-offs
Implementation strategies and organizational challenges

Exam Strategy for Domain 2

Domain 2 questions often present scenarios requiring you to choose appropriate SLIs or evaluate SLO implementations. Practice identifying user-centric metrics and understanding the business impact of different reliability targets.

Practice Question Types

Expect several types of questions in Domain 2:

Scenario-based SLI selection
Error budget calculations
SLO vs SLA differentiation
Implementation strategy evaluation
Time window trade-off analysis

Our comprehensive practice questions guide provides detailed examples of these question types and explanation strategies.

Common SLO Implementation Pitfalls

Understanding common mistakes helps both in exam preparation and real-world implementation. These pitfalls frequently appear in exam scenarios where candidates must identify problematic approaches.

Technical Pitfalls

Vanity metrics: Choosing SLIs that look good but don't reflect user experience
Over-specification: Setting too many SLOs, diluting focus and impact
Unrealistic targets: Setting SLOs that are unachievable with current architecture
Measurement gaps: Failing to account for client-side or end-to-end experience

Organizational Pitfalls

Lack of stakeholder alignment: Implementing SLOs without business buy-in
Insufficient automation: Manual processes that can't scale with system complexity
Poor communication: Failing to explain SLO benefits and trade-offs
Rigid policies: Error budget policies that don't account for business context

Real-World SLO Examples and Case Studies

Practical examples help solidify conceptual understanding and prepare you for scenario-based exam questions. These examples demonstrate how theoretical concepts apply in production environments.

E-commerce Platform SLOs

Consider an e-commerce platform with the following SLO structure:

User Journey	SLI	SLO Target	Time Window
Product Search	Search request latency	95th percentile < 500ms	30-day rolling
Checkout Process	Transaction success rate	99.95% successful	Calendar month
Page Loading	Page availability	99.9% of requests successful	Weekly rolling

This example demonstrates how different user journeys require different SLI types and targets based on business impact and user expectations.

API Service SLOs

API services often require different SLO approaches due to their programmatic nature:

Availability: 99.95% of API calls return successful responses
Latency: 90th percentile response time under 100ms
Throughput: Handle 50,000 requests per second during peak hours
Quality: 99.99% of responses contain valid, complete data

These examples help illustrate the principles tested in Domain 2 and provide context for understanding implementation decisions.

Connecting Domain 2 to Other Areas

SLO knowledge directly supports understanding of monitoring and observability concepts and provides the foundation for learning from failure practices covered in other exam domains.

Frequently Asked Questions

What's the difference between SLIs, SLOs, and SLAs?

SLIs are metrics that measure service performance, SLOs are internal targets based on those metrics, and SLAs are external contractual commitments. SLOs should be stricter than SLAs to provide operational buffer.

How do I calculate error budgets from SLOs?

Error budget equals 100% minus the SLO percentage. For a 99.9% availability SLO, the error budget is 0.1%, which translates to specific downtime allowances based on the time window (43.2 minutes per month for a 30-day window).

Should I use rolling or calendar time windows for SLOs?

Rolling windows provide more consistent measurement but are complex to implement. Calendar windows are simpler but can hide systematic issues. Choose based on your operational capabilities and business requirements.

How many SLOs should a service have?

Start with 2-3 SLOs covering the most critical user journeys. Too many SLOs dilute focus and make it difficult to prioritize reliability work. You can add more as your SRE practice matures.

What happens when error budgets are exhausted?

Error budget policies should define specific responses, typically including halting feature releases, focusing engineering effort on reliability improvements, and escalating to appropriate stakeholders until service reliability is restored.

Ready to Start Practicing?

Master Domain 2 concepts with our comprehensive practice tests. Our questions mirror the actual exam format and include detailed explanations for every answer. Start practicing today and build the confidence you need to pass on your first attempt.

Start Free Practice Test