Your Error Contract Is a Ticking Time Bomb: Why Microservices Break at 3 AM

TL;DR: Your integration test hits the happy path: 200 OK, correct response body, all good. But what happens when the downstream returns a 404? A 500? A 503 with retry-after? If your service parses the error response body, and the body format changes silently, you will find out at 3 AM when the retry storm starts. Error contracts are just as important as success contracts, and almost nobody tests them.
The Incident Pattern
Here is a story that plays out every few weeks in microservices teams:
-
Monday. The payments team updates their error handling. Instead of returning
{"code": "INSUFFICIENT_FUNDS", "message": "..."}, the service now returns{"error": "insufficient_funds", "description": "..."}. The change looks clean. Their tests pass. -
Tuesday. The orders service deploys a routine update. No changes to payment integration code.
-
Wednesday, 3 AM. Monitoring alerts fire. The orders service is stuck in a retry loop. Every payment failure causes an
HttpMessageNotReadableExceptionbecause the orders service expectscodebut receiveserror. The error handler catches the exception, logs "payment service unavailable", and retries. Infinitely. -
Root cause. The orders service has 47 happy-path integration tests for the payment flow. Zero tests for the error response format.
This is not a hypothetical. This is the most common pattern behind 3 AM incidents in microservices architectures.
Why Error Paths Are Undertested
Engineering teams have a structural bias toward happy-path testing:
Happy path is easy to test. You send a valid request, you get a valid response, you assert on the fields. The test is short and clear.
Error path requires simulation. To test a 404 from the downstream, you need to set up the mock to return an error. To test a 503 with retry-after, you need to configure the mock with specific headers. To test a timeout, you need to simulate network delay. Each scenario is more work than the happy path.
Error paths are "edge cases" in developers' minds. The happy path is the "normal" behavior. Errors are exceptions. The mental model says: "if the happy path works, errors will be handled by the catch block." But the catch block has assumptions about the error format that nobody verified.
Contract tests skip errors. Pact and Spring Cloud Contract focus on defining expected behavior. Negative scenarios require explicit examples, which teams often postpone. The contract says "when I send a valid request, I get this response." It rarely says "when the resource does not exist, I get this exact error body."
The result: a service with 95% test coverage on the happy path and 0% coverage on the error response format.
Five Error Contract Regressions That Break Production
1. Structured JSON Becomes Plain Text
Before:
HTTP/1.1 404 Not Found
Content-Type: application/json
{
"code": "CUSTOMER_NOT_FOUND",
"message": "Customer 42 not found",
"correlationId": "abc-123"
}
After:
HTTP/1.1 404 Not Found
Content-Type: text/html
<html><body><h1>404 Not Found</h1></body></html>
What breaks: The upstream calls objectMapper.readValue(response.body(), ErrorResponse.class). It receives HTML. HttpMessageNotReadableException. The error handler does not expect a parsing failure at this point and throws an unhandled exception, which returns 500 to the caller, which retries.
Why it happens: A reverse proxy, gateway, or Spring Boot error page configuration intercepted the response before the application's @ExceptionHandler could format it. This is common after infrastructure changes: gateway updates, Spring Security reconfiguration, Kubernetes ingress changes.
2. Error Field Names Change
Before:
{
"code": "INSUFFICIENT_FUNDS",
"message": "Account balance too low"
}
After:
{
"error": "insufficient_funds",
"description": "Account balance too low"
}
What breaks: The upstream reads errorResponse.getCode() to decide whether to retry. The code field is null (Jackson ignores unknown fields by default). The retry logic checks if (code == null) retry() because it assumes a null code means a transient error. Infinite retry loop.
Why it happens: The downstream team adopted a new error response standard (RFC 7807, a shared library convention, or just a code style preference). They updated all their error handlers. Their tests pass because they test the new format. Nobody told the upstream team.
3. Status Code Changes Silently
Before:
HTTP/1.1 404 Not Found
{"code": "NOT_FOUND", "message": "..."}
After:
HTTP/1.1 400 Bad Request
{"code": "VALIDATION_ERROR", "message": "..."}
What breaks: The upstream has retry logic based on status codes:
if (status >= 500) {
retry(); // transient error
} else if (status == 404) {
return Optional.empty(); // resource not found, handle gracefully
} else {
throw new PaymentException("Unexpected error");
}
A 404 was handled gracefully. A 400 hits the else branch and throws. The caller receives a 500. The user sees "Something went wrong."
Why it happens: The downstream refactored validation logic. What used to be "resource not found" is now "invalid request" because the ID format validation runs before the lookup. The behavior is technically correct from the downstream's perspective.
4. Error Body Becomes Empty
Before:
HTTP/1.1 422 Unprocessable Entity
{
"code": "VALIDATION_FAILED",
"violations": [
{"field": "amount", "reason": "must be positive"},
{"field": "currency", "reason": "unsupported"}
]
}
After:
HTTP/1.1 422 Unprocessable Entity
(empty body)
What breaks: The upstream parses violations to build a user-facing error message. NullPointerException because response.body() returns an empty string, and objectMapper.readValue("", ErrorResponse.class) throws MismatchedInputException.
Why it happens: The downstream added @Valid on a controller parameter, and Spring's default MethodArgumentNotValidException handler returns 422 with an empty body (or with Spring Boot's default error format, which has different field names than the custom error handler).
5. Retryable Becomes Non-Retryable (or Vice Versa)
Before:
HTTP/1.1 503 Service Unavailable
Retry-After: 30
{"code": "TEMPORARILY_UNAVAILABLE", "retryable": true}
After:
HTTP/1.1 500 Internal Server Error
{"code": "INTERNAL_ERROR"}
What breaks: The upstream checks retryable: true before retrying. After the change, the field is absent. The upstream treats the error as non-retryable and immediately returns a failure to the user. What was a 30-second delay is now an instant failure.
The reverse is worse: A non-retryable error (e.g., INSUFFICIENT_FUNDS) becomes retryable. The upstream retries the payment 5 times. The customer is charged 5 times.
Why it happens: The downstream team simplified their error model. They removed the retryable field because "the status code should be enough." Technically correct. Practically, every upstream client was relying on that field.
The Pattern: Error Contracts Are Invisible Dependencies
All five cases share the same root cause:
-
The error response format is an implicit contract. It is not documented in OpenAPI (or documented but not enforced). It is not covered by Pact examples. It is not tested in integration tests.
-
The upstream builds logic on the error format. Retry policies, user-facing messages, fallback behavior, circuit breaker triggers, and logging all depend on specific fields in the error response.
-
The downstream changes the format without notifying consumers. Because the error format is implicit, the change looks like an internal cleanup. No breaking change flag. No API version bump.
-
The failure is delayed and amplified. The upstream does not crash immediately. It enters degraded behavior: wrong retry logic, misleading error messages, infinite loops, silent data corruption.
How to Verify Error Contract Stability
Option 1: Explicit error-path integration tests
Write tests that mock downstream errors and verify how the upstream handles them:
@Test
void payment_404_returns_empty_optional() {
when(paymentClient.charge(any()))
.thenThrow(new FeignException.NotFound(
"Not Found",
Request.create(GET, "/api/payments", Map.of(), null, UTF_8),
"{\"code\":\"NOT_FOUND\",\"message\":\"...\"}".getBytes(),
Map.of()
));
Optional<PaymentResult> result = orderService.processPayment("order-123");
assertThat(result).isEmpty();
}
@Test
void payment_500_with_html_does_not_cause_retry_storm() {
when(paymentClient.charge(any()))
.thenThrow(new FeignException.InternalServerError(
"Error",
Request.create(GET, "/api/payments", Map.of(), null, UTF_8),
"<html>Internal Server Error</html>".getBytes(),
Map.of()
));
assertThatThrownBy(() -> orderService.processPayment("order-123"))
.isInstanceOf(PaymentUnavailableException.class);
verify(paymentClient, times(1)).charge(any()); // no retry
}
This works but requires writing a test for every error scenario. In a system with 10 downstream services and 5 error scenarios each, that is 50 tests to write and maintain.
Option 2: Defensive error parsing
Build error handlers that never assume the response body format:
private ErrorResponse parseErrorSafely(Response response) {
try {
if (response.body() == null || response.body().length() == 0) {
return ErrorResponse.unknown(response.status());
}
String contentType = response.headers()
.getOrDefault("Content-Type", List.of("")).get(0);
if (!contentType.contains("application/json")) {
return ErrorResponse.nonJson(response.status(), contentType);
}
return objectMapper.readValue(response.body(), ErrorResponse.class);
} catch (Exception e) {
return ErrorResponse.unparseable(response.status(), e.getMessage());
}
}
This handles the parsing problem but does not detect when the error contract changes. The service continues running, but the business behavior silently changes.
Option 3: Before/after trace comparison
Trace both the happy path and the error path. BitDive captures the full HTTP exchange including error responses. After a code change, trigger the same error scenario (e.g., by sending a request with a non-existent customer ID) and compare the two traces.
The diff shows exactly what changed in the error response: status code, content-type header, body format, field names, and whether the retryable flag disappeared.
See a real before/after trace comparison in the Interactive Demo with Cursor (video).
Catch Error Contract Regressions from Real Traces
BitDive captures real HTTP exchanges including error responses. Compare error behavior before and after code changes: status codes, error bodies, retry headers. Caught from actual API calls, not theoretical schemas.
Try BitDive FreeFAQ
Should I test every possible error from every downstream?
Focus on error scenarios that drive behavior: retries, fallbacks, user-facing messages, and circuit breakers. If your code branches on the error response content, that branch needs a test. If it simply logs and re-throws, the risk is lower.
Is RFC 7807 (Problem Details) the solution?
RFC 7807 standardizes error format, which helps. But standardizing the format does not prevent changes to the content. A service can return RFC 7807-compliant errors and still change the type URI, the status code, or the detail text in ways that break upstream parsing logic. The format is a good foundation. Runtime verification is still needed.
How do I get the downstream to return an error for testing?
Use a non-existent resource ID, an invalid payload, or a revoked auth token. In staging environments, many services have test endpoints or feature flags that simulate errors. The goal is to capture a trace of the error path so you have a baseline to compare against after changes.
Related Reading
- Detecting Inter-Service API Regression -- The broader problem of silent API drift
- Spring Boot Integration Testing -- Full-chain testing with trace replay
- BitDive vs. Contract Testing (Pact) -- Runtime traces vs. static contracts
- Glossary: API Regression -- Definition and related terms
