SRE Exam Prep Free practice test →

Free SRE Practice Questions

10 free, exam-style Site Reliability Engineering Foundation (SRE) practice questions with answers and explanations. No signup required. Work through them below, then take the full free SRE practice test to study every exam domain.

Question 1

The VALET model is used to categorize key SLO metrics. A colleague preparing for the SRE Foundation exam tells you that the 'T' in VALET stands for 'Traffic.' This statement is:

  1. Correct - the T in VALET represents Traffic, measuring the volume of requests hitting the service over a given period
  2. Incorrect - the T stands for Tickets, which measures the rate of manual intervention required; traffic demand is covered by V for Volume
  3. Partially correct - the T originally represented Traffic in v1.0 but was expanded to include both Traffic and Tickets in later versions
  4. Correct - the five VALET components are Volume, Availability, Latency, Errors, and Traffic as defined in the official syllabus
Show answer & explanation

Correct answer: B - Incorrect - the T stands for Tickets, which measures the rate of manual intervention required; traffic demand is covered by V for Volume

Question 2

A service has an SLO of 99.9% monthly availability. During the first week of the month, a deployment failure causes 30 minutes of unplanned downtime. What percentage of the monthly error budget has been consumed?

  1. Approximately 30% - since 30 minutes is roughly one-third of the total monthly error budget allowance
  2. Approximately 50% - since the month is divided equally between uptime allowance and total budget capacity
  3. Approximately 70% - since the total monthly error budget at 99.9% is roughly 43 minutes of allowed downtime
  4. Approximately 100% - since 30 minutes of downtime at this SLO level fully exhausts the available budget
Show answer & explanation

Correct answer: C - Approximately 70% - since the total monthly error budget at 99.9% is roughly 43 minutes of allowed downtime

Question 3

An SRE spends three hours manually experimenting with different monitoring configurations to find the optimal alerting setup for a new service. No standard procedure exists for this task. According to SRE principles, this activity is classified as:

  1. Not toil - experimentation to solve a problem is explicitly excluded from the definition of toil, even when it is manual and time-consuming
  2. Toil - it is manual work performed by a human that consumes engineering time and should be automated as soon as possible
  3. Toil - any unplanned operational activity that takes more than one hour automatically qualifies as toil under SRE definitions
  4. Not toil - but only because the SRE chose to perform the task voluntarily rather than being formally assigned the work by management
Show answer & explanation

Correct answer: A - Not toil - experimentation to solve a problem is explicitly excluded from the definition of toil, even when it is manual and time-consuming

Question 4

A user reports that a web page takes 2 seconds to fully load. The team's analysis shows that network communication delay is 200ms, server processing takes 1,500ms, and data transfer accounts for 300ms. Using the Catchpoint SRE Survey definition, the latency of this request is:

  1. 2,000ms - latency encompasses the entire duration from the moment a user sends a request until they receive a complete response
  2. 1,500ms - latency refers specifically to the server-side processing time, which is the largest single component of the total request duration
  3. 300ms - latency measures the data transfer portion of the request, being the time spent transmitting the response payload back to the client
  4. 200ms - per the Catchpoint definition, latency is specifically the delay incurred in communicating a message, distinct from total response time
Show answer & explanation

Correct answer: D - 200ms - per the Catchpoint definition, latency is specifically the delay incurred in communicating a message, distinct from total response time

Question 5

An SRE team has comprehensive monitoring dashboards that detect when metrics breach predefined thresholds. However, when a novel issue causes intermittent slowdowns that stay within alert thresholds, the team struggles to diagnose the root cause. What capability are they missing?

  1. More sensitive threshold-based alerts with significantly lower trigger values that catch smaller deviations from the expected baseline performance
  2. Full observability - the ability to explore metrics, logs, and traces to understand WHY the system behaves unexpectedly, not just detect WHEN
  3. Additional infrastructure monitoring covering CPU utilization, memory consumption, and disk I/O across all production servers in the environment
  4. A dedicated incident commander role to coordinate cross-team diagnosis efforts and ensure systematic investigation of all production anomalies
Show answer & explanation

Correct answer: B - Full observability - the ability to explore metrics, logs, and traces to understand WHY the system behaves unexpectedly, not just detect WHEN

Question 6

An SRE team deploys a new version of their payment service to 5% of production traffic using a canary strategy. Within 10 minutes, monitoring detects that the error rate in the canary group has doubled compared to the baseline. The automated pipeline immediately reverts the canary to the previous version. The 95% of users on the stable version experienced no impact. This outcome demonstrates:

  1. A successful canary deployment with automated rollback - the system detected the problem early and protected the majority of users as designed
  2. A failed deployment process - the new version contained errors that should have been caught during pre-production testing before release
  3. A monitoring false positive - a doubling of the error rate within a small 5% traffic sample is likely statistical noise rather than a real issue
  4. An over-engineered pipeline - the automated rollback triggered too aggressively and prevented the team from investigating the issue further
Show answer & explanation

Correct answer: A - A successful canary deployment with automated rollback - the system detected the problem early and protected the majority of users as designed

Question 7

After a production incident, a team member reports the issue to management. A formal investigation is launched to determine which documented policy was violated and which team member failed to follow the established procedure. The investigation concludes with updated compliance documentation. According to the Westrum organizational culture model, this response is characteristic of which culture type?

  1. Pathological - the investigation is designed to identify and punish the individual who deviated from expectations, using fear as a control mechanism
  2. Generative - the organization is responding constructively to the incident by updating its documentation and strengthening operational procedures
  3. Bureaucratic - the focus on policy violations, formal procedures, and seeking justice through established process is characteristic of rule-oriented culture
  4. Progressive - the formal investigation and documentation updates show that the organization takes production incidents seriously and values accountability
Show answer & explanation

Correct answer: C - Bureaucratic - the focus on policy violations, formal procedures, and seeking justice through established process is characteristic of rule-oriented culture

Question 8

An SRE team of six engineers supports a critical e-commerce platform. Each engineer spends approximately 50% of their time on on-call duties, including responding to pages, performing manual remediation, and handling escalations. According to Google's SRE best practices, this on-call allocation indicates:

  1. A well-balanced rotation - 50% on-call ensures strong production coverage and demonstrates the team's commitment to service reliability goals
  2. Optimal staffing - the team has exactly the right number of engineers for their current workload and on-call duties are distributed fairly
  3. Healthy commitment - spending half of available time on-call shows the team prioritizes production stability over other discretionary engineering work
  4. Excessive on-call burden - Google recommends no more than 25% of SRE time on on-call duties; 50% is double the target and risks burnout
Show answer & explanation

Correct answer: D - Excessive on-call burden - Google recommends no more than 25% of SRE time on on-call duties; 50% is double the target and risks burnout

Question 9

After a major service outage, the SRE team conducts a blameless post-mortem. Engineers who contributed to the incident provide detailed accounts of their actions and decisions without fear of punishment. What is the PRIMARY output that this process should produce?

  1. A detailed assessment identifying which team members' actions contributed most significantly to the severity and duration of the incident
  2. A list of follow-up actions to mitigate future similar incidents, with owners and deadlines assigned for each remediation item
  3. An updated Service Level Agreement that reflects the reduced reliability the service demonstrated during the incident and recovery period
  4. A revised error budget policy with stricter enforcement thresholds to prevent the team from deploying aggressively until the next review
Show answer & explanation

Correct answer: B - A list of follow-up actions to mitigate future similar incidents, with owners and deadlines assigned for each remediation item

Question 10

A cloud database provider wants to improve the resilience of their managed PostgreSQL and MySQL offerings. They plan to apply SRE reliability engineering principles specifically to their database systems, including running controlled failure experiments to test database cluster failover and replication under stress. Which SRE spinoff role BEST describes this specialized function?

  1. Database Reliability Engineer (DBRE) - responsible for keeping database systems running smoothly, with a distinguishing focus on applying chaos engineering
  2. Network Reliability Engineer (NRE) - applies a reliability engineering approach to systematically measure, test, and automate reliability of network systems
  3. Customer Reliability Engineer (CRE) - takes SRE principles and applies them towards customers, bridging internal reliability practices and user experience
  4. Heritage Reliability Engineer (HRE) - applies SRE principles and practices to legacy applications and older technology environments across the organization
Show answer & explanation

Correct answer: A - Database Reliability Engineer (DBRE) - responsible for keeping database systems running smoothly, with a distinguishing focus on applying chaos engineering

Ready for the real thing?

Practice hundreds more SRE questions with instant scoring, weak-area drills, and full exam simulations.

Start the free practice test See pricing