Free SRE Practice Questions

Question 1

The VALET model is used to categorize key SLO metrics. A colleague preparing for the SRE Foundation exam tells you that the 'T' in VALET stands for 'Traffic.' This statement is:

Accepted Answer

Incorrect - the T stands for Tickets, which measures the rate of manual intervention required; traffic demand is covered by V for Volume

Answer

Correct - the T in VALET represents Traffic, measuring the volume of requests hitting the service over a given period

Answer

Partially correct - the T originally represented Traffic in v1.0 but was expanded to include both Traffic and Tickets in later versions

Answer

Correct - the five VALET components are Volume, Availability, Latency, Errors, and Traffic as defined in the official syllabus

Question 2

A service has an SLO of 99.9% monthly availability. During the first week of the month, a deployment failure causes 30 minutes of unplanned downtime. What percentage of the monthly error budget has been consumed?

Accepted Answer

Approximately 70% - since the total monthly error budget at 99.9% is roughly 43 minutes of allowed downtime

Answer

Approximately 30% - since 30 minutes is roughly one-third of the total monthly error budget allowance

Answer

Approximately 50% - since the month is divided equally between uptime allowance and total budget capacity

Answer

Approximately 100% - since 30 minutes of downtime at this SLO level fully exhausts the available budget

Question 3

An SRE spends three hours manually experimenting with different monitoring configurations to find the optimal alerting setup for a new service. No standard procedure exists for this task. According to SRE principles, this activity is classified as:

Accepted Answer

Not toil - experimentation to solve a problem is explicitly excluded from the definition of toil, even when it is manual and time-consuming

Answer

Toil - it is manual work performed by a human that consumes engineering time and should be automated as soon as possible

Answer

Toil - any unplanned operational activity that takes more than one hour automatically qualifies as toil under SRE definitions

Answer

Not toil - but only because the SRE chose to perform the task voluntarily rather than being formally assigned the work by management

Question 4

A user reports that a web page takes 2 seconds to fully load. The team's analysis shows that network communication delay is 200ms, server processing takes 1,500ms, and data transfer accounts for 300ms. Using the Catchpoint SRE Survey definition, the latency of this request is:

Accepted Answer

200ms - per the Catchpoint definition, latency is specifically the delay incurred in communicating a message, distinct from total response time

Answer

2,000ms - latency encompasses the entire duration from the moment a user sends a request until they receive a complete response

Answer

1,500ms - latency refers specifically to the server-side processing time, which is the largest single component of the total request duration

Answer

300ms - latency measures the data transfer portion of the request, being the time spent transmitting the response payload back to the client

Question 5

An SRE team has comprehensive monitoring dashboards that detect when metrics breach predefined thresholds. However, when a novel issue causes intermittent slowdowns that stay within alert thresholds, the team struggles to diagnose the root cause. What capability are they missing?

Accepted Answer

Full observability - the ability to explore metrics, logs, and traces to understand WHY the system behaves unexpectedly, not just detect WHEN

Answer

More sensitive threshold-based alerts with significantly lower trigger values that catch smaller deviations from the expected baseline performance

Answer

Additional infrastructure monitoring covering CPU utilization, memory consumption, and disk I/O across all production servers in the environment

Answer

A dedicated incident commander role to coordinate cross-team diagnosis efforts and ensure systematic investigation of all production anomalies

Question 6

An SRE team deploys a new version of their payment service to 5% of production traffic using a canary strategy. Within 10 minutes, monitoring detects that the error rate in the canary group has doubled compared to the baseline. The automated pipeline immediately reverts the canary to the previous version. The 95% of users on the stable version experienced no impact. This outcome demonstrates:

Accepted Answer

A successful canary deployment with automated rollback - the system detected the problem early and protected the majority of users as designed

Answer

A failed deployment process - the new version contained errors that should have been caught during pre-production testing before release

Answer

A monitoring false positive - a doubling of the error rate within a small 5% traffic sample is likely statistical noise rather than a real issue

Answer

An over-engineered pipeline - the automated rollback triggered too aggressively and prevented the team from investigating the issue further

Question 7

After a production incident, a team member reports the issue to management. A formal investigation is launched to determine which documented policy was violated and which team member failed to follow the established procedure. The investigation concludes with updated compliance documentation. According to the Westrum organizational culture model, this response is characteristic of which culture type?

Accepted Answer

Bureaucratic - the focus on policy violations, formal procedures, and seeking justice through established process is characteristic of rule-oriented culture

Answer

Pathological - the investigation is designed to identify and punish the individual who deviated from expectations, using fear as a control mechanism

Answer

Generative - the organization is responding constructively to the incident by updating its documentation and strengthening operational procedures

Answer

Progressive - the formal investigation and documentation updates show that the organization takes production incidents seriously and values accountability

Question 8

An SRE team of six engineers supports a critical e-commerce platform. Each engineer spends approximately 50% of their time on on-call duties, including responding to pages, performing manual remediation, and handling escalations. According to Google's SRE best practices, this on-call allocation indicates:

Accepted Answer

Excessive on-call burden - Google recommends no more than 25% of SRE time on on-call duties; 50% is double the target and risks burnout

Answer

A well-balanced rotation - 50% on-call ensures strong production coverage and demonstrates the team's commitment to service reliability goals

Answer

Optimal staffing - the team has exactly the right number of engineers for their current workload and on-call duties are distributed fairly

Answer

Healthy commitment - spending half of available time on-call shows the team prioritizes production stability over other discretionary engineering work

Question 9

After a major service outage, the SRE team conducts a blameless post-mortem. Engineers who contributed to the incident provide detailed accounts of their actions and decisions without fear of punishment. What is the PRIMARY output that this process should produce?

Accepted Answer

A list of follow-up actions to mitigate future similar incidents, with owners and deadlines assigned for each remediation item

Answer

A detailed assessment identifying which team members' actions contributed most significantly to the severity and duration of the incident

Answer

An updated Service Level Agreement that reflects the reduced reliability the service demonstrated during the incident and recovery period

Answer

A revised error budget policy with stricter enforcement thresholds to prevent the team from deploying aggressively until the next review

Question 10

A cloud database provider wants to improve the resilience of their managed PostgreSQL and MySQL offerings. They plan to apply SRE reliability engineering principles specifically to their database systems, including running controlled failure experiments to test database cluster failover and replication under stress. Which SRE spinoff role BEST describes this specialized function?

Accepted Answer

Database Reliability Engineer (DBRE) - responsible for keeping database systems running smoothly, with a distinguishing focus on applying chaos engineering

Answer

Network Reliability Engineer (NRE) - applies a reliability engineering approach to systematically measure, test, and automate reliability of network systems

Answer

Customer Reliability Engineer (CRE) - takes SRE principles and applies them towards customers, bridging internal reliability practices and user experience

Answer

Heritage Reliability Engineer (HRE) - applies SRE principles and practices to legacy applications and older technology environments across the organization

Free SRE Practice Questions

Domain 2: Service Level Objectives 16% of exam

Question 1

Question 2

Domain 3: Toil and Automation 12% of exam

Question 3

Domain 4: Monitoring and Observability 12% of exam

Question 4

Question 5

Domain 5: Release Engineering and Change Management 12% of exam

Question 6

Domain 6: Anti-Fragility and Learning from Failure 16% of exam

Question 7

Domain 7: Organizational Impact of SRE 12% of exam

Question 8

Question 9

More SRE practice questions

Question 10

The rest of the SRE blueprint

Ready for the real thing?