Tag: Saga Pattern

Saga Orchestration vs Choreography on Hazelcast

Part 10 in the “Building Event-Driven Microservices with Hazelcast” series

Back in Part 4, we built a choreographed saga for order fulfillment. Four services — Order, Inventory, Payment, and Account — coordinate through Hazelcast ITopic events. Each service reacts independently, no central coordinator. The flow is implicit, spread across three saga listeners, a compensation registry, and a timeout detector.

That works. It works well, actually, for loosely coupled flows where services don’t need to know about each other. But I kept running into the same questions: what if you need the whole saga visible in one place? What if you need per-step timeout and retry? What if the caller wants to wait for the saga to finish before responding?

So the framework now supports a second saga architectural pattern — the orchestrated saga. This post compares both, walks through the implementation, and shows how they run side by side in the same system.

When to Use Which

Neither pattern wins across the board. It depends on what you need:

Requirement	Choreography	Orchestration
Services should be fully decoupled	Best
Need the whole flow in one readable file		Best
Caller needs a synchronous response		Best
High throughput (thousands of sagas/sec)	Best
Per-step timeout and retry		Best
Services evolve independently	Best
Complex branching or conditional logic		Best
No single point of failure	Best

Choreography is the better fit when services publish events for many consumers, not just one saga. Adding a new saga consumer doesn’t require changing any existing service — you just stand up a new listener.

Orchestration is the better fit when the saga is a well-defined workflow with a clear owner and the caller (a REST endpoint, typically) wants to return the result directly. Order fulfillment is a textbook example.

Architecture Comparison

Choreographed Flow

Choreographed saga flow: the caller receives 202 Accepted immediately from the Order Service, then OrderCreated, StockReserved, and PaymentProcessed events propagate as asynchronous Hazelcast ITopic messages through Inventory, Payment, and Account services with no central coordinator

Every arrow is an asynchronous event on the shared Hazelcast cluster. The caller gets back a 202 Accepted immediately. The saga completes whenever it completes.

Orchestrated Flow

Orchestrated saga flow: the caller invokes the Saga Orchestrator and waits for the final result while the orchestrator makes synchronous HTTP calls for CreateOrder, ReserveStock, ProcessPayment, and ConfirmOrder to the Order, Inventory, Payment, and Account services

Every arrow is a synchronous HTTP call. The orchestrator waits for each step to finish before moving to the next. The caller gets the final result — success or failure — in the response.

The visual difference tells you most of what you need to know. Choreography is a chain. Orchestration is a hub with spokes.

Implementation Comparison

Choreography: Saga Listeners

In the choreographed pattern, each service has a listener subscribed to events on the shared Hazelcast cluster:

@Component
public class InventorySagaListener {

    public InventorySagaListener(
            @Qualifier("hazelcastClient") HazelcastInstance hazelcast,
            InventoryService inventoryService,
            SagaStateStore sagaStateStore) {

        ITopic<GenericRecord> topic = hazelcast.getTopic("OrderCreated");
        topic.addMessageListener(message -> {
            GenericRecord event = message.getMessageObject();
            String sagaId = event.getString("sagaId");

            // Guard: only process OrderFulfillment sagas
            if (!"OrderFulfillment".equals(event.getString("sagaType"))) return;

            // Perform local action
            inventoryService.reserveStockForSaga(productId, quantity, ...);

            // Update saga state and publish next event
            sagaStateStore.updateOrAddStep(sagaId, 1, StepStatus.COMPLETED);
            hazelcast.getTopic("StockReserved").publish(nextEvent);
        });
    }
}

Three listeners across three services, wired together only by event names. To understand the full flow, you read code in three different modules. Compensation is handled by a CompensationRegistry that maps forward events to their compensating counterparts:

registry.register("OrderCreated", "OrderCancelled", "order-service");
registry.register("StockReserved", "StockReleased", "inventory-service");
registry.register("PaymentProcessed", "PaymentRefunded", "payment-service");

Orchestration: SagaDefinition Builder

The orchestrated version puts the entire saga in one place:

@Component
public class OrderFulfillmentSagaFactory {

    public SagaDefinition create() {
        return SagaDefinition.builder()
                .name("OrderFulfillmentOrchestrated")

                .step("CreateOrder")
                    .action(this::createOrderAction)
                    .compensation(this::createOrderCompensation)
                    .timeout(Duration.ofSeconds(15))
                    .build()

                .step("ReserveStock")
                    .action(this::reserveStockAction)
                    .compensation(this::reserveStockCompensation)
                    .timeout(Duration.ofSeconds(15))
                    .build()

                .step("ProcessPayment")
                    .action(this::processPaymentAction)
                    .compensation(this::processPaymentCompensation)
                    .timeout(Duration.ofSeconds(15))
                    .build()

                .step("ConfirmOrder")
                    .action(this::confirmOrderAction)
                    .noCompensation()
                    .timeout(Duration.ofSeconds(10))
                    .build()

                .sagaTimeout(Duration.ofSeconds(60))
                .build();
    }
}

Four steps, forward actions, compensations, timeouts — all readable in one file. You pay for that readability: the Order Service now has direct knowledge of the Inventory and Payment services’ HTTP endpoints.

The Orchestrator State Machine

HazelcastSagaOrchestrator is the engine that executes a SagaDefinition. The execution flow:

HazelcastSagaOrchestrator execution flow: start records the saga and schedules a 60-second timeout, then each step runs its action under a 15-second timeout, branching to SUCCESS which merges context and executes the next step, FAILURE after retries which compensates in reverse order, or TIMEOUT which compensates in reverse order

Internally, a SagaExecution instance tracks the running state: current step index, completed step names, an AtomicBoolean for compensation (preventing a race between step failure and the saga-level timeout firing at the same moment), and step start timestamps for duration metrics.

Per-Step Retry

Each SagaStep can configure maxRetries and retryDelay. When a step fails, the orchestrator checks if retries remain. If so, it waits retryDelay milliseconds and re-executes. If not, compensation kicks in.

This is separate from the Resilience4j circuit breakers that the choreographed saga listeners use. Different communication styles, different retry mechanisms.

Why HTTP Instead of ITopic?

You might wonder why the orchestrator makes HTTP calls to the other services instead of publishing events on Hazelcast ITopic.

Two reasons.

First, request-response semantics. The orchestrator needs to know whether each step succeeded before proceeding to the next. ITopic is fire-and-forget — there’s no built-in way for a publisher to wait for a consumer’s response. HTTP gives you synchronous request-response for free.

Second, our dual-instance architecture. Each service runs an embedded Hazelcast instance for Jet pipelines and a client to the shared cluster for cross-service events. Jet pipeline lambdas reference service-specific classes that can’t serialize across services — that’s the whole reason for the dual-instance design (see Part 5 for where this architecture first bit us). HTTP sidesteps Hazelcast serialization entirely. Each service processes the request in its own JVM with full access to its own classes.

The SagaServiceClient wraps these calls:

public class SagaServiceClient implements SagaServiceClientOperations {

    public OrchestratedStepResponse reserveStock(
            String productId, int quantity, String orderId) {
        // POST /api/saga/inventory/reserve-stock
        return restTemplate.postForObject(
                inventoryServiceUrl + "/api/saga/inventory/reserve-stock",
                request, OrchestratedStepResponse.class);
    }

    public OrchestratedStepResponse processPayment(
            String orderId, String customerId,
            double amount, String currency, String method) {
        // POST /api/saga/payment/process
        return restTemplate.postForObject(
                paymentServiceUrl + "/api/saga/payment/process",
                request, OrchestratedStepResponse.class);
    }
}

Each remote service exposes dedicated saga endpoints (like /api/saga/inventory/reserve-stock) that return an OrchestratedStepResponse — a success/failure envelope. The SagaServiceClient implements the SagaServiceClientOperations interface, which exists so Mockito can mock it on Java 25. (Mockito’s inline mock maker can’t mock concrete classes there. We hit this in several places — extract an interface, move on.)

Compensation: Two Approaches

Choreography: Event-Based

When a step fails, the SagaCompensator looks up the CompensationRegistry and publishes compensation events via ITopic:

Each service processes its own compensation event independently. The SagaTimeoutDetector can also trigger this if a saga exceeds its deadline.

Orchestration: Lambda-Based

When a step fails, the orchestrator walks completed steps in reverse and executes their compensation lambdas directly:

No events, no registry. The compensation logic sits right next to the forward action in the SagaDefinition. The final step (ConfirmOrder) uses .noCompensation() — there’s nothing to undo once you’ve confirmed.

If a compensation step itself fails, the orchestrator marks the saga as FAILED rather than COMPENSATED. That means manual intervention. It’s not a situation you want, but at least you know about it immediately rather than discovering it later in an audit.

Running Both Simultaneously

Both patterns coexist in the same system. No interference.

Pattern	Saga Type	REST Endpoint
Choreography	OrderFulfillment	POST /api/orders
Orchestration	OrderFulfillmentOrchestrated	POST /api/orders/orchestrated

The key to coexistence is the sagaType field. Choreographed saga listeners filter on it — when the orchestrated flow creates an OrderCreated event, the InventorySagaListener ignores it because the type is “OrderFulfillmentOrchestrated”, not “OrderFulfillment”.

Both patterns write to the same SagaStateStore (a Hazelcast IMap), so you can query across both or filter:

# All sagas
curl http://localhost:8083/api/sagas

# Only choreographed
curl http://localhost:8083/api/sagas?type=OrderFulfillment

# Only orchestrated
curl http://localhost:8083/api/sagas?type=OrderFulfillmentOrchestrated

The MCP list_sagas tool supports the same filter:

list_sagas(type="OrderFulfillmentOrchestrated", status="COMPLETED")

Observability

The orchestrator records a saga.step.duration timer for every step:

private void recordStepDuration(SagaExecution exec, String stepName) {
    if (sagaMetrics != null && exec.stepStartedAt != null) {
        Duration stepDuration = Duration.between(exec.stepStartedAt, Instant.now());
        sagaMetrics.recordStepDuration(
                exec.definition.getName(), stepName, stepDuration);
    }
}

Tagged with sagaType and stepName, so you can query individual steps:

# p95 duration for the ProcessPayment step
histogram_quantile(0.95,
  rate(saga_step_duration_seconds_bucket{
    sagaType="OrderFulfillmentOrchestrated",
    stepName="ProcessPayment"
  }[5m]))

Choreographed sagas track overall saga_duration_seconds but not per-step timing — the flow is distributed across services, so there’s no single place to measure each step. That’s a genuine observability trade-off between the two patterns.

The Grafana saga dashboard has a Choreography vs Orchestration row: p50/p95 duration comparison, success/failure rates per pattern, and an orchestrated step breakdown showing where time goes across CreateOrder, ReserveStock, ProcessPayment, and ConfirmOrder.

The MCP run_demo tool includes orchestrated scenarios:

Scenario	Pattern	Expected Outcome
happy_path	Choreographed	Order confirmed via events
payment_failure	Choreographed	Stock released via compensation events
orchestrated_happy_path	Orchestrated	Order confirmed via HTTP, sync response
orchestrated_payment_failure	Orchestrated	Stock released via reverse compensation, 409 response

The Summary Table

	Choreography	Orchestration
Communication	Hazelcast ITopic events	HTTP calls
Flow definition	Distributed across listeners	Centralized in SagaDefinition
Compensation	CompensationRegistry + event publishing	Reverse-order lambda execution
Timeout handling	SagaTimeoutDetector (scheduled)	Per-step + saga-level timeouts
Response model	Async (202 Accepted)	Sync (201 Created or 409 Conflict)
Retry	Resilience4j (circuit breaker + retry)	Built-in per-step retry with delay
Metrics	Saga-level duration	Saga-level + per-step duration
Saga type	OrderFulfillment	OrderFulfillmentOrchestrated

Choreography is still the right default for most event-sourced systems — it preserves service independence and scales naturally. Orchestration earns its place when you need the flow readable in one file, synchronous responses, and fine-grained per-step control.

Both patterns share the same SagaStateStore, the same Grafana dashboards, and the same MCP tools. Pick the right one for each saga, or run both and compare.

Previous: Dead Letter Queue + Idempotency: Exactly-Once on Hazelcast

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

July 6, 2026

Dead Letter Queue + Idempotency: Exactly-Once on Hazelcast

Part 9 in the “Building Event-Driven Microservices with Hazelcast” series

Over the past two articles, we built resilience into both sides of our saga communication. Part 7 added circuit breakers and retry to protect saga listeners against transient failures during event consumption. Part 8 added the transactional outbox to guarantee event delivery from producer to shared cluster.

Two gaps remain.

First: what happens when an event fails processing permanently? The circuit breaker exhausts retries. NonRetryableException gets thrown. The event is gone — all that survives is a log message. There’s no way to inspect what failed, understand why, or retry it later when someone fixes the underlying problem.

Second: what happens when the outbox delivers an event twice? At-least-once delivery means duplicates are possible. Without protection, the Inventory Service might reserve stock twice for the same order. The Payment Service might charge the customer twice.

This article covers two complementary patterns that close these gaps. The dead letter queue captures events that fail consumer-side processing, giving operators a way to inspect, replay, and discard them. The idempotency guard ensures each event is processed exactly once, even if delivered multiple times.

Put them together with the outbox and you get effectively-once semantics — at-least-once delivery on the producer side, exactly-once processing on the consumer side. That’s the gold standard for event-driven systems.

Part 1: Dead Letter Queue

The Problem

Consider this failure sequence in the Inventory Service’s saga listener:

Permanent failure without a dead letter queue: an OrderCreated event arrives, the InventorySagaListener picks it up and calls executeWithResilience, a non-retryable InsufficientStockException is thrown, the circuit breaker records the failure, a ResilienceException propagates, the error is logged, and the event is gone with only a log line surviving

That log message? That’s all you’ve got. In production, recovering from this means searching logs for the event ID, reconstructing the event payload from other sources, manually fixing whatever went wrong, and then figuring out how to re-trigger the saga step.

A dead letter queue captures the failed event — payload, failure reason, saga context, source service, everything — in a durable store that you can actually query and act on.

The DeadLetterEntry

Each DLQ entry preserves the full failure context:

public class DeadLetterEntry {

    private String dlqEntryId;       // UUID — unique DLQ identifier
    private String originalEventId;  // The event that failed
    private String eventType;        // "OrderCreated", "StockReserved", etc.
    private String topicName;        // The ITopic where the event was published
    private GenericRecord eventRecord; // The complete event payload for replay
    private String failureReason;    // Why processing failed
    private Instant failureTimestamp; // When the failure occurred
    private String sourceService;    // Which service failed ("inventory-service")
    private String sagaId;           // Saga context for tracing
    private String correlationId;    // Correlation context for tracing
    private int replayCount;         // How many times this entry has been replayed
    private Status status;           // PENDING, REPLAYED, or DISCARDED

    public enum Status {
        PENDING,    // Awaiting review or replay
        REPLAYED,   // Re-published to original topic
        DISCARDED   // Manually discarded by administrator
    }
}

Construction at the failure site uses a builder:

DeadLetterEntry.builder()
    .originalEventId(eventId)
    .eventType(record.getString("eventType"))
    .topicName("OrderCreated")
    .eventRecord(record)
    .failureReason(error.getMessage())
    .sourceService("inventory-service")
    .sagaId(record.getString("sagaId"))
    .correlationId(record.getString("correlationId"))
    .build();

The eventRecord field is the important one — it holds the complete GenericRecord that was published to the ITopic. When you replay the entry, this exact record gets re-published to the original topic, picking the saga back up where it left off.

The DeadLetterQueueOperations Interface

Same interface-extraction pattern we used for ResilientOperations and ServiceClientOperations (Java 25 Mockito can’t mock concrete classes, so we keep extracting interfaces — it’s becoming a running theme):

public interface DeadLetterQueueOperations {

    void add(DeadLetterEntry entry);

    List<DeadLetterEntry> list(int limit);

    DeadLetterEntry getEntry(String dlqEntryId);

    void replay(String dlqEntryId);

    void discard(String dlqEntryId);

    long count();
}

HazelcastDeadLetterQueue: IMap-Backed Storage

The implementation stores DLQ entries as Compact-serialized GenericRecords in a Hazelcast IMap — same pattern as the HazelcastOutboxStore:

public class HazelcastDeadLetterQueue implements DeadLetterQueueOperations {

    private static final String SCHEMA_NAME = "DeadLetterEntry";
    private final HazelcastInstance hazelcast;
    private final IMap<String, GenericRecord> dlqMap;
    private final DeadLetterQueueProperties properties;
    private final MeterRegistry meterRegistry;

    public HazelcastDeadLetterQueue(HazelcastInstance hazelcast,
                                     DeadLetterQueueProperties properties,
                                     MeterRegistry meterRegistry) {
        this.hazelcast = hazelcast;
        this.dlqMap = hazelcast.getMap(properties.getMapName());
        // ...
    }
}

The DLQ map lives on the shared cluster (falling back to the embedded instance if there’s no shared cluster), so it’s accessible from any service’s admin endpoint. You don’t need to know which service failed — query the DLQ from anywhere and you’ll see everything.

POJO-to-GenericRecord Conversion

Like the outbox store, conversion happens at the boundary:

static GenericRecord toRecord(final DeadLetterEntry entry) {
    return GenericRecordBuilder.compact(SCHEMA_NAME)
            .setString("dlqEntryId", entry.getDlqEntryId())
            .setString("originalEventId", entry.getOriginalEventId())
            .setString("eventType", entry.getEventType())
            .setString("topicName", entry.getTopicName())
            .setGenericRecord("eventRecord", entry.getEventRecord())
            .setString("failureReason", entry.getFailureReason())
            .setInt64("failureTimestamp", entry.getFailureTimestamp().toEpochMilli())
            .setString("sourceService", entry.getSourceService())
            .setString("sagaId", entry.getSagaId())
            .setString("correlationId", entry.getCorrelationId())
            .setInt32("replayCount", entry.getReplayCount())
            .setString("status", entry.getStatus().name())
            .build();
}

Note setGenericRecord(“eventRecord”, …) — Compact serialization handles nested GenericRecords natively. The full event payload comes along for the ride without any special serialization work on our part.

Replay

This is where the DLQ earns its keep. Once you’ve figured out what went wrong and fixed it — restocked inventory, restarted a flaky service, whatever — you replay the entry:

@Override
public void replay(final String dlqEntryId) {
    final GenericRecord record = dlqMap.get(dlqEntryId);
    if (record == null) {
        throw new IllegalArgumentException("DLQ entry not found: " + dlqEntryId);
    }

    final DeadLetterEntry entry = fromRecord(record);

    if (entry.getStatus() != DeadLetterEntry.Status.PENDING) {
        throw new IllegalStateException(
                "Cannot replay entry in status " + entry.getStatus());
    }
    if (entry.getReplayCount() >= properties.getMaxReplayAttempts()) {
        throw new IllegalStateException(
                "Max replay attempts (" + properties.getMaxReplayAttempts() + ") exceeded");
    }

    // Re-publish to the original topic
    final GenericRecord eventRecord = entry.getEventRecord();
    if (eventRecord != null && entry.getTopicName() != null) {
        final ITopic<GenericRecord> topic = hazelcast.getTopic(entry.getTopicName());
        topic.publish(eventRecord);
    }

    // Update entry status
    entry.setReplayCount(entry.getReplayCount() + 1);
    entry.setStatus(DeadLetterEntry.Status.REPLAYED);
    dlqMap.set(dlqEntryId, toRecord(entry));

    meterRegistry.counter("dlq.entries.replayed").increment();
}

A few safety guards here. Only PENDING entries can be replayed — you can’t accidentally replay something that was already replayed or discarded. There’s a configurable max replay count (default 3) to prevent infinite replay loops if the underlying issue isn’t actually fixed. And if the eventRecord is somehow null (shouldn’t happen, but defensive coding), the status updates without attempting a publish.

Monitoring Queue Depth

The count() method uses a Hazelcast predicate to count only PENDING entries:

@Override
public long count() {
    final Collection<GenericRecord> pending = dlqMap.values(
            Predicates.equal("status", DeadLetterEntry.Status.PENDING.name()));
    return pending.size();
}

A DLQ count above zero for more than a few minutes is a flag that something needs attention. Wire this to an alert and you’ll know about failed events before anyone files a ticket.

Admin REST Endpoints

The DeadLetterQueueController exposes the DLQ through REST:

@RestController
@RequestMapping("/api/admin/dlq")
@Tag(name = "Dead Letter Queue")
public class DeadLetterQueueController {

    @GetMapping
    public ResponseEntity<List<DeadLetterEntry>> list(
            @RequestParam(defaultValue = "20") int limit) {
        return ResponseEntity.ok(deadLetterQueue.list(limit));
    }

    @GetMapping("/count")
    public ResponseEntity<Map<String, Long>> count() {
        return ResponseEntity.ok(Map.of("count", deadLetterQueue.count()));
    }

    @GetMapping("/{id}")
    public ResponseEntity<DeadLetterEntry> getEntry(@PathVariable String id) { ... }

    @PostMapping("/{id}/replay")
    public ResponseEntity<Map<String, String>> replay(@PathVariable String id) { ... }

    @DeleteMapping("/{id}")
    public ResponseEntity<Map<String, String>> discard(@PathVariable String id) { ... }
}

A typical investigation looks like this:

# How many pending entries?
curl http://localhost:8082/api/admin/dlq/count
# {"count": 2}

# What are they?
curl http://localhost:8082/api/admin/dlq
# [{"dlqEntryId":"abc-123", "originalEventId":"evt-456",
#   "eventType":"OrderCreated", "failureReason":"Insufficient stock for product PROD-789",
#   "sourceService":"inventory-service", "status":"PENDING", ...}]

# Get the full details on one
curl http://localhost:8082/api/admin/dlq/abc-123

# Fix the problem (restock inventory), then replay
curl -X POST http://localhost:8082/api/admin/dlq/abc-123/replay
# {"status":"replayed", "dlqEntryId":"abc-123"}

# Or discard if the saga already timed out and compensation ran
curl -X DELETE http://localhost:8082/api/admin/dlq/abc-123
# {"status":"discarded", "dlqEntryId":"abc-123"}

Integration with Saga Listeners

Each saga listener injects the DLQ as an optional dependency:

@Autowired(required = false)
public void setDeadLetterQueue(DeadLetterQueueOperations deadLetterQueue) {
    this.deadLetterQueue = deadLetterQueue;
}

Failed events get routed to the DLQ in the error handler:

private void sendToDeadLetterQueue(GenericRecord record, String topicName, Throwable error) {
    String eventId = record.getString("eventId");
    if (deadLetterQueue != null) {
        try {
            deadLetterQueue.add(DeadLetterEntry.builder()
                    .originalEventId(eventId)
                    .eventType(record.getString("eventType"))
                    .topicName(topicName)
                    .eventRecord(record)
                    .failureReason(error.getMessage())
                    .sourceService("inventory-service")
                    .sagaId(record.getString("sagaId"))
                    .correlationId(record.getString("correlationId"))
                    .build());
            logger.warn("Event {} sent to DLQ after failure: {}", eventId, error.getMessage());
        } catch (Exception dlqError) {
            logger.error("Failed to send event {} to DLQ: {}", eventId, dlqError.getMessage());
        }
    } else {
        // Fallback: existing behavior (log only)
        if (error instanceof ResilienceException) {
            logger.warn("Circuit breaker open, saga step deferred: eventId={}", eventId);
        } else {
            logger.error("Failed to process event: {}", eventId, error);
        }
    }
}

The try/catch around deadLetterQueue.add() is defensive. If the DLQ itself fails — shared cluster unreachable, say — we fall back to logging. The DLQ is best-effort, not a hard requirement. Losing an event and failing to capture it in the DLQ would be truly unlucky, but it shouldn’t bring the service down.

Part 2: Idempotency Guard

The Problem

The transactional outbox gives us at-least-once delivery. Combined with ITopic’s own delivery behavior (listeners that reconnect after a brief disconnection may receive messages again), the same event can arrive at a consumer more than once:

Duplicate delivery sequence: the OutboxPublisher publishes OrderCreated to the shared cluster, which forwards it to the Inventory Listener; the markDelivered call times out, so on the next poll cycle the publisher re-publishes the same event, and without protection the Inventory Listener performs a double stock reservation

Without protection, inventory gets reserved twice. The customer gets charged twice. The order gets confirmed twice. Nobody wants that.

Atomic Check-and-Claim

The fix is Hazelcast’s putIfAbsent — an atomic, cluster-wide check-and-set that ensures each event ID gets processed exactly once:

public class HazelcastIdempotencyGuard implements IdempotencyGuard {

    private final IMap<String, Long> processedEventsMap;
    private final long ttlMillis;
    private final MeterRegistry meterRegistry;

    public HazelcastIdempotencyGuard(HazelcastInstance hazelcast,
                                      IdempotencyProperties properties,
                                      MeterRegistry meterRegistry) {
        this.processedEventsMap = hazelcast.getMap(properties.getMapName());
        this.ttlMillis = properties.getTtl().toMillis();
        this.meterRegistry = meterRegistry;
    }

    @Override
    public boolean tryProcess(final String eventId) {
        Long previous = processedEventsMap.putIfAbsent(
                eventId, System.currentTimeMillis(), ttlMillis, TimeUnit.MILLISECONDS);

        boolean firstTime = (previous == null);
        meterRegistry.counter("idempotency.checks",
                "result", firstTime ? "miss" : "hit").increment();

        if (!firstTime) {
            logger.debug("Duplicate event detected: eventId={}", eventId);
        }

        return firstTime;
    }
}

The interface is one method:

public interface IdempotencyGuard {
    boolean tryProcess(String eventId);
}

Returns true if this is the first time the event ID has been seen — go ahead and process it. Returns false if someone already claimed it — skip.

How putIfAbsent Works

IMap.putIfAbsent(key, value, ttl, timeUnit) is atomic. If the key doesn’t exist, it inserts the pair and returns null. If it does exist, it returns the existing value and does nothing. This atomicity holds across cluster members — two listeners on different nodes processing the same event simultaneously will never both get null. Exactly one wins, the other backs off.

TTL: Forgetting Old Events

The putIfAbsent includes a TTL (default: 1 hour). After that, the event ID is removed from the map, and the same event could theoretically be reprocessed if it somehow arrived again.

Why an hour? It’s a memory management decision. Without a TTL, the processed events map grows forever. With a 1-hour window, we hold at most an hour’s worth of event IDs, which is bounded and predictable. Since our outbox publisher has a 1-second poll interval with 5 retries, duplicates arrive within seconds — an hour of margin is more than sufficient.

Integration with Saga Listeners

Each saga listener checks the guard at the top of its message handler:

class OrderCreatedListener implements MessageListener<GenericRecord> {

    @Override
    public void onMessage(Message<GenericRecord> message) {
        GenericRecord record = message.getMessageObject();

        String eventId = record.getString("eventId");
        if (idempotencyGuard != null && eventId != null
                && !idempotencyGuard.tryProcess(eventId)) {
            logger.debug("Duplicate event {} already processed, skipping", eventId);
            return;
        }

        // ... proceed with normal processing
    }
}

Three null checks for graceful degradation: if idempotency isn’t configured, process everything (no deduplication). If the event doesn’t have an ID, skip the check. If tryProcess() returns false, it’s a duplicate — drop it silently.

The Processed Events Map

The map lives on the shared cluster, so deduplication works across all service instances:

Key	Value	TTL
evt-abc-123	1738000000000 (timestamp)	1 hour
evt-def-456	1738000001000	1 hour
evt-ghi-789	1738000002000	1 hour

The value — a processing timestamp — is purely informational. Only the key’s presence or absence matters for deduplication. But the timestamp is handy for debugging: it tells you exactly when an event was first processed.

How the Three Patterns Work Together

The outbox, DLQ, and idempotency guard form a complete reliability pipeline:

The three patterns working together: on the producer side the EventSourcingController writes to the OUTBOX IMap, the OutboxPublisher polls and publishes to the shared ITopic, then marks the entry delivered; on the consumer side the saga listener checks the IdempotencyGuard (duplicates are skipped), runs executeWithResilience, and on failure routes the event to the framework_DLQ IMap for admin replay or discard

Let’s walk through what happens when things go wrong.

The OrderCreated event comes out of the Jet pipeline and gets written to the outbox. The OutboxPublisher picks it up, publishes to the shared cluster’s OrderCreated ITopic, and tries to mark it DELIVERED. But the markDelivered call times out. Next poll cycle, the publisher re-publishes the same event. Now it’s been delivered twice.

Over on the consumer side, the Inventory Service’s OrderCreatedListener receives both copies. The first call to idempotencyGuard.tryProcess(“evt-123”) returns true — process it. The second call returns false — duplicate, skip it. Only one stock reservation happens.

But that first delivery hits a problem: the product is out of stock. InsufficientStockException is non-retryable. The circuit breaker records the failure, ResilienceException propagates up to whenComplete(), and sendToDeadLetterQueue() captures everything — the full event payload, the failure reason, the saga ID, the source service. It’s all sitting in the framework_DLQ IMap, waiting.

An operator (or an LLM, as we’ll see in a moment) checks the DLQ, sees the pending entry, restocks the product, and replays the event. The OrderCreated record gets re-published to the ITopic, the saga picks up, and the order completes.

One wrinkle: the replayed event carries the same eventId as the original. If the 1-hour idempotency TTL hasn’t expired yet, the guard will block it as a duplicate. In practice this isn’t an issue — by the time you’ve investigated the failure, diagnosed the root cause, and fixed it, an hour has usually passed. It’s a deliberate trade-off: short-window deduplication versus immediate replay. We chose deduplication.

Configuration Reference

Dead Letter Queue: framework.dlq.*

Property	Default	Description
enabled	true	Master toggle
map-name	framework_DLQ	IMap name on shared cluster
max-replay-attempts	3	Maximum replays before permanent block
entry-ttl	168h	7-day retention for DLQ entries

Idempotency Guard: framework.idempotency.*

Property	Default	Description
enabled	true	Master toggle
map-name	framework_PROCESSED_EVENTS	IMap name on shared cluster
ttl	1h	How long to remember processed event IDs

Metrics

Metric	Type	Description
dlq.entries.added	Counter	Events added to the DLQ
dlq.entries.replayed	Counter	Events replayed from the DLQ
dlq.entries.discarded	Counter	Events discarded from the DLQ
idempotency.checks	Counter (tagged: result=hit\|miss)	Deduplication checks

The Complete Resilience Stack

Across Parts 7, 8, and 9, we’ve built five interlocking patterns:

Pattern	Layer	Purpose	Protects Against
Circuit Breaker	Consumer	Automatic service isolation	Cascade failures
Retry + Backoff	Consumer	Transient failure recovery	Network blips, brief outages
Transactional Outbox	Producer	Guaranteed delivery	Shared cluster unavailability
Dead Letter Queue	Consumer	Failure capture and replay	Permanent processing failures
Idempotency Guard	Consumer	Exactly-once processing	Duplicate delivery

They’re all optional — enabled by default, disabled with a single property toggle. They’re all auto-configured by Spring Boot. They all expose Micrometer metrics. And when disabled, the framework falls back to its previous behavior without breaking anything.

Three articles ago, we had a fire-and-forget event pipeline where a network blip could lose an event forever. Now we have guaranteed delivery, deduplication, failure capture, and replay. Same pipeline, five patterns later.

Try It Yourself

The demo script includes a complete DLQ investigation scenario — fault injection, failure capture, investigation, and replay — in 11 guided steps:

./scripts/demo-scenarios.sh 7

That’s the curl-based version. No LLM required.

The AI-Powered Version

This is more fun. Connect the MCP server from Part 6 to your LLM client — Claude Desktop, Claude Code, ChatGPT, whatever you’ve got — and try this prompt:

“Run the DLQ investigation demo — inject a failure, place an order, and show me what’s in the dead letter queue.”

The LLM calls runDemo to set up the scenario, then listDlqEntries and inspectDlqEntry to investigate. It tells you what happened — which event failed, at which service, and why — and suggests a fix. You say “replay it.” It calls replayDlqEntry, the saga completes, and you’ve just done incident response through a conversation.

No curl commands. No JSON parsing. No copy-pasting UUIDs. The LLM handles the plumbing while you make the decisions.

If the LLM already has context from earlier in the session, a shorter version works:

“Run the dlq_investigation demo scenario and tell me what you find.”

Next up: Choreography vs Orchestration: Two Saga Patterns

Previous: Hazelcast Transactional Outbox: Guaranteed Delivery

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

June 29, 2026

Hazelcast Transactional Outbox: Guaranteed Delivery

Part 8 in the “Building Event-Driven Microservices with Hazelcast” series

Introduction

In Part 7, we added circuit breakers and retry to protect saga listeners from transient failures on the consumer side. That covers what happens when a service receives an event and can’t process it. But we haven’t talked about what happens when the event never leaves the building.

Quick refresher on our dual-instance architecture: each service runs an embedded Hazelcast instance for local Jet pipeline processing and a client connected to the shared cluster for cross-service ITopic communication. After the pipeline processes an event, the EventSourcingController republishes it to the shared cluster so saga listeners in other services can react.

That republish step? It was a fire-and-forget call:

// The old approach — fragile
try {
    ITopic<GenericRecord> topic = sharedHazelcast.getTopic(pending.eventType);
    topic.publish(pending.eventRecord);
} catch (Exception e) {
    logger.warn("Failed to republish event {}: {}", pending.eventType, e.getMessage());
    // Event is permanently lost!
}

If the shared cluster is unreachable — network partition, cluster restart, someone tripping over the power cable — the event vanishes. The saga never progresses. Eventually the saga timeout detector marks it as failed, but by then the original event data is gone and there’s nothing to retry.

The Transactional Outbox Pattern fixes this. Instead of publishing directly to the shared cluster, the controller writes the event to a local outbox — an IMap on the embedded Hazelcast instance — and a separate publisher component picks it up and delivers it. If delivery fails, the entry stays in the outbox and gets retried.

Why Direct Publishing Fails

The problem is fundamental. Publishing to an external system (the shared cluster) and completing a local operation (the Jet pipeline) are two separate operations that can’t be made atomic.

Failure timeline for direct publishing — the Jet pipeline updates the local event store and materialized view, but the publish to the shared cluster ITopic fails on a network partition and the event is lost with nothing left to retry

The event is safely stored in the local event store and materialized view, but the cross-service notification is lost. You could retry in place, but that blocks the Jet pipeline for all events. You could schedule an async retry, but if the process restarts, that retry state is gone too.

The outbox pattern trades immediate delivery for guaranteed delivery. Write to a durable local store, deliver asynchronously, retry until it works. It’s the standard solution in event-driven architectures for good reason.

Architecture

The outbox IMap lives on the embedded Hazelcast instance — the same instance that hosts the event store and materialized views. Writing to it is a local operation. If the embedded instance is up (and it must be, since the pipeline just ran), the outbox write succeeds.

The OutboxEntry

Each outbox entry captures everything needed to deliver the event later:

public class OutboxEntry {

    private String eventId;          // Matches the domain event's eventId
    private String eventType;        // ITopic name (e.g., "OrderCreated")
    private GenericRecord eventRecord; // The serialized event to publish
    private int retryCount;          // Delivery attempts so far
    private Status status;           // PENDING, DELIVERED, or FAILED
    private Instant createdAt;       // When the entry was created
    private Instant lastAttemptAt;   // When the last delivery attempt occurred
    private String failureReason;    // Most recent failure message

    public enum Status {
        PENDING,    // Awaiting delivery
        DELIVERED,  // Successfully published to shared cluster
        FAILED      // Permanently failed after max retries
    }
}

The eventRecord field is the full GenericRecord that needs to go to the shared cluster’s ITopic — same record the Jet pipeline produces, complete with saga metadata like sagaId and correlationId.

OutboxStore: The Interface

Six methods covering the full lifecycle:

public interface OutboxStore {

    void write(OutboxEntry entry);

    List<OutboxEntry> pollPending(int maxBatchSize);

    void markDelivered(String eventId);

    void markFailed(String eventId, String reason);

    void incrementRetryCount(String eventId, String failureReason);

    long pendingCount();
}

Provider-agnostic. The Hazelcast implementation uses an IMap, but the interface could just as easily sit in front of a database table.

HazelcastOutboxStore

The Hazelcast implementation stores entries as Compact-serialized GenericRecord values in an IMap:

public class HazelcastOutboxStore implements OutboxStore {

    private static final String SCHEMA_NAME = "OutboxEntry";
    private final IMap<String, GenericRecord> outboxMap;

    public HazelcastOutboxStore(HazelcastInstance hazelcast, MeterRegistry meterRegistry) {
        this.outboxMap = hazelcast.getMap(DEFAULT_MAP_NAME);
    }
}

You might wonder why we’re using GenericRecord instead of storing OutboxEntry Java objects directly. The problem is that OutboxEntry has an Instant field and a nested GenericRecord — neither of which Hazelcast’s zero-config Compact serialization can handle. We’d need a custom CompactSerializer registered on every Hazelcast instance configuration. Instead, we convert at the boundary:

static GenericRecord toRecord(final OutboxEntry entry) {
    return GenericRecordBuilder.compact(SCHEMA_NAME)
            .setString("eventId", entry.getEventId())
            .setString("eventType", entry.getEventType())
            .setGenericRecord("eventRecord", entry.getEventRecord())
            .setInt32("retryCount", entry.getRetryCount())
            .setString("status", entry.getStatus().name())
            .setInt64("createdAt", entry.getCreatedAt().toEpochMilli())
            .setNullableInt64("lastAttemptAt",
                    entry.getLastAttemptAt() != null
                            ? entry.getLastAttemptAt().toEpochMilli() : null)
            .setString("failureReason", entry.getFailureReason())
            .build();
}

A few things going on here. Instant becomes int64 epoch millis — compact, sortable, unambiguous. lastAttemptAt uses setNullableInt64 because it’s null until the first delivery attempt. The nested eventRecord uses setGenericRecord, which Compact handles natively. And status is stored as the enum name string, which makes it readable in Management Center and queryable with Predicates.equal().

Polling uses a Hazelcast predicate to filter by status, sorted by creation time so the oldest entries are delivered first:

@Override
public List<OutboxEntry> pollPending(final int maxBatchSize) {
    final Collection<GenericRecord> pending = outboxMap.values(
            Predicates.equal("status", OutboxEntry.Status.PENDING.name()));

    return pending.stream()
            .map(HazelcastOutboxStore::fromRecord)
            .sorted(Comparator.comparing(OutboxEntry::getCreatedAt))
            .limit(maxBatchSize)
            .collect(Collectors.toList());
}

The OutboxPublisher

The publisher bridges the outbox and the shared cluster. The obvious approach is to poll on a fixed interval — once per second, say — but that adds latency we don’t need. We know exactly when a new entry arrives.

Event-Driven Wake-Up

The publisher uses a Semaphore to sleep until someone signals it:

public class OutboxPublisher {

    private final Semaphore wakeUp = new Semaphore(0);

    public void notifyNewEntry() {
        // Release at most 1 permit — avoids unbounded accumulation
        if (wakeUp.availablePermits() == 0) {
            wakeUp.release();
        }
    }

    public boolean waitForWork() {
        try {
            return wakeUp.tryAcquire(
                    properties.getPollInterval().toMillis(),
                    TimeUnit.MILLISECONDS);
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            return false;
        }
    }
}

When the EventSourcingController writes an outbox entry, it calls notifyNewEntry() right after. The publisher wakes up, claims all pending entries, delivers them. Under normal conditions, the time from event creation to shared-cluster delivery is sub-millisecond.

The poll interval (default 1 second) is the safety net. If a signal gets missed — maybe the publisher was busy with a previous batch — the timeout ensures nothing sits around for too long.

This is a JVM-local semaphore, not a distributed one. That’s fine. When the service scales to multiple replicas with per-service clustering (ADR 013), each replica has its own publisher. The semaphore wakes the local publisher instantly for locally-written events. Events written by other replicas get picked up within the poll interval. The actual coordination — preventing two replicas from delivering the same event — happens in claimPending() via an atomic ClaimEntryProcessor on the IMap.

The Publish Loop

public void publishPendingEntries() {
    if (sharedHazelcast == null) {
        if (!noSharedClusterWarningLogged) {
            logger.warn("No shared Hazelcast instance — outbox delivery skipped");
            noSharedClusterWarningLogged = true;
        }
        return;
    }

    List<OutboxEntry> claimed = outboxStore.claimPending(
            properties.getMaxBatchSize(), memberUuid);

    if (claimed.isEmpty()) {
        return;
    }

    for (OutboxEntry entry : claimed) {
        try {
            ITopic<GenericRecord> topic = sharedHazelcast.getTopic(entry.getEventType());
            topic.publish(entry.getEventRecord());
            outboxStore.markDelivered(entry.getEventId());
        } catch (Exception e) {
            if (entry.getRetryCount() + 1 >= properties.getMaxRetries()) {
                outboxStore.markFailed(entry.getEventId(),
                        "Max retries exceeded: " + e.getMessage());
            } else {
                outboxStore.incrementRetryCount(entry.getEventId(), e.getMessage());
            }
        }
    }
}

Note claimPending rather than pollPending. The claiming mechanism uses an EntryProcessor to atomically transition entries from PENDING to CLAIMED, tagging them with the claiming member’s UUID. This prevents two publisher instances from delivering the same event — important once you’re running multiple replicas.

When no shared cluster is configured (single-node dev mode), the publisher logs one warning and stops trying. Events pile up as PENDING in the outbox. They’ll drain as soon as a shared cluster appears.

Retry escalation is per-entry:

Attempt 1: fails → incrementRetryCount (retryCount=1)
Attempt 2: fails → incrementRetryCount (retryCount=2)
...
Attempt 5: fails → markFailed (retryCount=5 >= maxRetries=5)

Once marked FAILED, the entry stops showing up in claim results. The failure reason is preserved for debugging.

Scheduling

OutboxAutoConfiguration hooks the publisher into Spring’s task scheduler:

@EnableScheduling
public class OutboxAutoConfiguration implements SchedulingConfigurer {

    @Override
    public void configureTasks(ScheduledTaskRegistrar taskRegistrar) {
        taskRegistrar.addFixedDelayTask(() -> {
            outboxPublisher.waitForWork();       // blocks until signaled or timeout
            outboxPublisher.publishPendingEntries();
        }, 1);  // 1ms loop delay — actual timing controlled by semaphore
    }
}

The 1ms fixed delay means the loop restarts almost immediately after each cycle, but waitForWork() controls the actual pacing. The thread blocks on the semaphore until either a permit is released or the poll interval elapses. Near-instant delivery under normal load, guaranteed pickup if a signal is missed.

Integration with EventSourcingController

The controller’s republishToSharedCluster now checks for an outbox store first:

private void republishToSharedCluster(PendingCompletion<K> pending) {
    if (sharedHazelcast == null || pending.eventRecord == null || pending.eventType == null) {
        return;
    }
    if (outboxStore != null) {
        OutboxEntry entry = new OutboxEntry(
                pending.completionInfo.getEventId(),
                pending.eventType,
                pending.eventRecord
        );
        outboxStore.write(entry);
        if (outboxPublisher != null) {
            outboxPublisher.notifyNewEntry();
        }
    } else {
        // Legacy direct publish (when outbox is disabled)
        try {
            ITopic<GenericRecord> topic = sharedHazelcast.getTopic(pending.eventType);
            topic.publish(pending.eventRecord);
        } catch (Exception e) {
            logger.warn("Failed to republish event {}: {}", pending.eventType, e.getMessage());
        }
    }
}

Fully backward compatible. When outboxStore is injected, events go through the durable path. When it’s null, you get the old fire-and-forget behavior. The OutboxStore is wired through each service’s config as an optional dependency:

@Bean
public EventSourcingController<Order, String, DomainEvent<Order, String>> orderController(
        HazelcastInstance hazelcastInstance,
        @Qualifier("hazelcastClient") HazelcastInstance hazelcastClient,
        @Autowired(required = false) OutboxStore outboxStore,
        ...) {
    return EventSourcingController.builder()
            .hazelcast(hazelcastInstance)
            .sharedHazelcast(hazelcastClient)
            .outboxStore(outboxStore)
            .build();
}

Delivery Guarantees

The outbox provides at-least-once delivery. If the publisher crashes after publishing to the ITopic but before calling markDelivered(), the next cycle picks up the same entry and delivers it again. Events are never lost as long as the embedded Hazelcast instance’s IMap data is intact.

At-least-once means consumers may see duplicates. That’s where the Idempotency Guard from Part 9 comes in — it deduplicates on the consumer side, complementing the outbox’s guaranteed delivery.

As for ordering: events for the same aggregate are written to the outbox in sequence order (the Jet pipeline processes them sequentially), and claimPending sorts by createdAt. But if two events are pending simultaneously and the first one fails while the second succeeds, they’ll arrive out of order. For our saga use case that’s acceptable — each step is identified by sagaId and eventType, and the saga state machine handles duplicates and out-of-order delivery.

Configuration

framework.outbox.*

Property	Default	Description
enabled	true	Master toggle for the outbox pattern
poll-interval	1000 (ms)	Fallback interval if signal is missed
max-batch-size	50	Maximum entries per poll cycle
max-retries	5	Delivery attempts before permanent failure
entry-ttl	24h	How long DELIVERED entries survive in the map

Metrics

Metric	Type	Description
outbox.entries.written	Counter	Events written to the outbox
outbox.entries.delivered	Counter	Events delivered to shared cluster
outbox.entries.failed	Counter	Events permanently failed
outbox.publish.duration	Timer	Time per publish cycle

To disable the outbox and use direct publishing:

framework:
  outbox:
    enabled: false

What’s Next

The outbox guarantees events reach the shared cluster. But what happens when they get there and the consumer can’t process them? The consumer might crash, the business logic might throw, the circuit breaker might be open.

In Part 9, we add two patterns that work together: a Dead Letter Queue that captures events that fail consumer-side processing, and an Idempotency Guard that prevents duplicate processing — the natural flip side of at-least-once delivery.

Next up: Dead Letter Queues and Idempotency

Previous: Circuit Breakers and Retry: Resilient Hazelcast Sagas

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

June 22, 2026

Circuit Breakers and Retry: Resilient Hazelcast Sagas

Part 7 in the “Building Event-Driven Microservices with Hazelcast” series

Introduction

A commercial airliner doesn’t fall out of the sky when an engine fails. It keeps flying. The remaining engine provides enough thrust to reach the nearest airport, the crew follows a well-rehearsed procedure, and the passengers — ideally — never know how close things got. Aviation engineers figured this out decades ago: you can’t prevent every failure, so you build the system to keep working when parts of it stop. (There’s even a great acronym for it — ETOPS, which officially stands for Extended Twin-engine Operations Performance Standards, but which pilots will tell you really means “Engines Turn Or Passengers Swim.”)

Microservices need the same philosophy. Not because individual services fail as dramatically as a jet engine, but because they fail far more often. A garbage collection pause. A network blip. A downstream provider having a bad day. A deployment rolling through the cluster at 2 AM. In a monolith, these are minor hiccups — the kind of thing you might not even notice in the logs. In a distributed system where five services coordinate through asynchronous events, a hiccup in one service can propagate to all five in the time it takes to brew a cup of coffee.

And the ways things go wrong are… creative. The catalog of distributed system failure modes is large enough to fill a textbook. Several textbooks, actually — and people have. Too many for a single pattern or a single blog post.

So we’re spending the next three posts on resilience. This one covers circuit breakers and retry — protecting saga listeners when downstream services misbehave. Part 8 tackles the transactional outbox pattern, which guarantees events aren’t lost between producer and consumer. And Part 9 adds dead letter queues and idempotency guards — the safety nets for events that fail permanently or arrive more than once. Three different failure modes, three different mechanisms.

Back in Part 4, we built a choreographed saga for order fulfillment. Three services — Inventory, Payment, and Order — coordinate through Hazelcast ITopic events published on a shared cluster. The happy path works beautifully. Without resilience patterns, though, a single struggling service can drag the whole saga down with it. A slow Payment Service fills up the Inventory Service’s thread pool with blocked calls. A transient network error permanently loses an event. A burst of failures overwhelms everything simultaneously.

That’s what we’re fixing.

The Problem: Cascading Failures

Here’s the order fulfillment saga on a good day:

Order fulfillment saga happy path — Inventory, Payment, and Order services exchanging OrderCreated, StockReserved, PaymentProcessed, and OrderConfirmed events over Hazelcast ITopic

Each step is an ITopic message on the shared Hazelcast cluster. Each listener calls a local service method — IMap operations, Jet pipeline processing, further ITopic publishing. Events flow, state updates, everyone’s happy.

Now imagine the Payment Service is having a rough morning. Some downstream payment provider is dragging, and every StockReserved event that arrives takes 30 seconds to process instead of the normal 50 milliseconds. Without any resilience mechanism, here’s what unfolds:

Inventory keeps publishing StockReserved events at the normal rate
Payment’s listener thread pool fills up with slow calls
New events queue behind the blocked threads
ITopic backpressure eventually slows the shared cluster itself
Other listeners on the same cluster — including Inventory and Order — start seeing delays
The entire saga grinds to a halt

One service had a problem. Now every service has a problem. This is a cascade failure, and it’s the defining hazard of distributed architectures. The shared communication fabric that makes coordination possible is the same fabric that propagates failure.

Enter Resilience4j

The patterns we need — circuit breakers, retry with backoff, bulkheads, rate limiters — have been well understood for years. Netflix popularized them in the Java world with Hystrix, which became the standard library for microservice resilience through most of the 2010s. But Netflix put Hystrix into maintenance mode in 2018 and eventually stopped development entirely.

The successor that emerged is Resilience4j. It’s a lightweight fault tolerance library for Java 8+ built around functional composition — you wrap a Supplier or Runnable with decorators, and the decorators handle the resilience logic. It’s not just a circuit breaker library, though that’s what most people know it for. It actually provides six core modules: circuit breaker, retry, bulkhead (resource isolation), rate limiter, time limiter, and cache. Each is standalone. You pick what you need and leave the rest on the shelf.

There are other options — Failsafe is a solid zero-dependency alternative, and Alibaba’s Sentinel targets high-traffic rate limiting scenarios. But Resilience4j has become the de facto choice for Spring Boot microservices. The Spring integration is mature, Micrometer metrics work out of the box, and @ConfigurationProperties binding means your resilience settings live in the same YAML as everything else. For our framework, we’re using two of the six modules: CircuitBreaker and Retry.

Circuit Breakers: Automatic Service Isolation

A circuit breaker does what it sounds like. It monitors the failure rate of an operation and automatically stops calling it when failures exceed a threshold — the same idea as the breaker panel in your house. Too much current flows through the circuit, the breaker trips, the wiring doesn’t catch fire. In our case, “too much current” means too many failed calls, and “the wiring” is every other service sharing that communication path.

Three States

Circuit breaker state machine — CLOSED trips to OPEN when the failure rate crosses the threshold, OPEN moves to HALF-OPEN after the wait duration, and HALF-OPEN returns to CLOSED on success or back to OPEN on failure

CLOSED is normal operation. All calls pass through, and the circuit breaker quietly records outcomes in a sliding window. OPEN means the breaker has tripped — all calls are immediately rejected with a CallNotPermittedException, and no load reaches the downstream service at all. HALF-OPEN is the recovery probe: a limited number of test calls pass through. If they succeed, the breaker returns to CLOSED. If they fail, back to OPEN. Rinse and repeat until the downstream service gets its act together.

The Framework’s ResilientServiceInvoker

Rather than sprinkling Resilience4j decorators at every call site, we centralized everything into ResilientServiceInvoker:

public class ResilientServiceInvoker implements ResilientOperations {

    private final CircuitBreakerRegistry circuitBreakerRegistry;
    private final RetryRegistry retryRegistry;
    private final ResilienceProperties properties;

    public <T> T execute(final String name, final Supplier<T> operation) {
        if (!properties.isEnabled()) {
            return operation.get();
        }

        final CircuitBreaker circuitBreaker = circuitBreakerRegistry.circuitBreaker(name);
        final Retry retry = retryRegistry.retry(name);

        final Supplier<T> decoratedSupplier = CircuitBreaker.decorateSupplier(circuitBreaker,
                Retry.decorateSupplier(retry, operation));

        try {
            return decoratedSupplier.get();
        } catch (CallNotPermittedException e) {
            logger.warn("Circuit breaker '{}' is OPEN — rejecting call", name);
            throw new ResilienceException(
                    "Circuit breaker '" + name + "' is open, call rejected", name, e);
        } catch (Exception e) {
            logger.error("Operation '{}' failed after retries: {}", name, e.getMessage());
            throw new ResilienceException(
                    "Operation '" + name + "' failed after retries", name, e);
        }
    }
}

A few things to notice here. Each call to execute(“inventory-stock-reservation”, …) creates or retrieves a circuit breaker and retry instance with that name. This means each saga step gets its own independent circuit breaker — a payment failure won’t trip the inventory breaker.

The decoration order matters: retry wraps the operation first, then the circuit breaker wraps the retry. So the circuit breaker sees the final outcome after all retries are exhausted. A transient failure that succeeds on the second attempt counts as a success for the circuit breaker. If you stacked them the other way around, every individual failed attempt would register as a circuit breaker failure, and you’d trip the breaker much faster than you intended.

And there’s a kill switch. When framework.resilience.enabled=false, the execute method just calls the operation directly. Zero overhead. This matters for testing and for environments where resilience is handled at a different layer — a service mesh, maybe, or a cloud provider’s load balancer.

The ResilientOperations Interface

We extract an interface from the concrete class:

public interface ResilientOperations {
    <T> T execute(String name, Supplier<T> operation);
    void executeRunnable(String name, Runnable operation);
    <T> CompletableFuture<T> executeAsync(String name, Supplier<CompletableFuture<T>> operation);
}

This is the same workaround we used for ServiceClientOperations in Part 6. Java 25’s Mockito inline mock maker can’t mock concrete classes in certain JVM configurations, so you extract an interface and mock that instead. Not the most glamorous reason to create an abstraction, but it works.

Three Flavors

The invoker supports three calling patterns:

// Synchronous — returns a value
String result = invoker.execute("orderSaga", () -> processEvent(event));

// Fire-and-forget — void operation
invoker.executeRunnable("paymentListener", () -> publishToTopic(event));

// Async — returns CompletableFuture
CompletableFuture<Product> future = invoker.executeAsync("inventory-stock-reservation",
        () -> inventoryService.reserveStockForSaga(productId, quantity, ...));

The async variant is the one our saga listeners actually use — inventory, payment, and order service calls all return CompletableFuture.

Wiring into the Saga Listeners

The saga listeners from Part 4 now inject ResilientOperations as an optional dependency:

@Component
public class InventorySagaListener {

    private final ProductService inventoryService;
    private final HazelcastInstance hazelcast;
    private ResilientOperations resilientServiceInvoker;

    @Autowired(required = false)
    public void setResilientOperations(ResilientOperations resilientServiceInvoker) {
        this.resilientServiceInvoker = resilientServiceInvoker;
    }

That @Autowired(required = false) is doing important work. If resilience is disabled — or if the Resilience4j dependency isn’t even on the classpath — the listener still functions. It just calls the service directly, no wrapping. The saga worked before we added resilience; it should keep working without it.

Each listener has a helper that handles the null check:

private <T> CompletableFuture<T> executeWithResilience(
        final String name, final Supplier<CompletableFuture<T>> operation) {
    if (resilientServiceInvoker != null) {
        return resilientServiceInvoker.executeAsync(name, operation);
    }
    return operation.get();
}

And the actual saga step looks like this:

executeWithResilience("inventory-stock-reservation",
        () -> inventoryService.reserveStockForSaga(
                productId, quantity, orderId, sagaId, correlationId,
                customerId, total, currency, "CREDIT_CARD"
        )
).whenComplete((product, error) -> {
    if (error != null) {
        sendToDeadLetterQueue(record, "OrderCreated", error);
    } else {
        logger.info("Stock reserved for saga: productId={}, quantity={}, orderId={}, sagaId={}",
                productId, quantity, orderId, sagaId);
    }
});

The circuit breaker name inventory-stock-reservation is specific to this saga step. Each step across the three services gets its own name and its own circuit breaker:

Circuit Breaker Name	Saga Step	Service
inventory-stock-reservation	Reserve stock on OrderCreated	Inventory
inventory-stock-release	Release stock on compensation	Inventory
payment-processing	Process payment on StockReserved	Payment
payment-refund	Refund payment on compensation	Payment
order-confirmation	Confirm order on PaymentProcessed	Order
order-cancellation	Cancel order on compensation	Order

Six independent circuit breakers. If payment processing is struggling, the inventory breakers stay closed and keep doing their job.

Retry with Exponential Backoff

Transient failures — network blips, temporary overload, brief GC pauses — are the most common failure mode in distributed systems. Most of them resolve on their own within seconds. Retry is the first line of defense.

The Thundering Herd

But naive retry — retry immediately, same interval, keep hammering — can make things actively worse. Picture this: a service buckles under load, and 100 clients all get errors simultaneously. They all retry at 500ms. The service sees a spike of 100 simultaneous requests. It fails again. They all retry at 1000ms. Another spike. Same result.

This is the thundering herd problem. Everyone backs off at the same fixed interval, and everyone comes stampeding back at the same moment. The retry mechanism that was supposed to help is the thing keeping the service down.

Exponential backoff breaks the herd apart:

Attempt 1: immediate
Attempt 2: wait 500ms
Attempt 3: wait 1000ms  (500ms × 2.0)
Attempt 4: wait 2000ms  (1000ms × 2.0)

The growing intervals give the struggling service breathing room. And because different callers started their retry sequences at slightly different moments, the backoff naturally staggers the waves. Each one arrives smaller and more spread out than the last. The herd thins itself out.

Configuration

The framework exposes all of this through ResilienceProperties:

framework:
  resilience:
    enabled: true
    retry:
      max-attempts: 3
      wait-duration: 500ms
      enable-exponential-backoff: true
      exponential-backoff-multiplier: 2.0

The auto-configuration translates these into a Resilience4j RetryConfig:

@Bean
@ConditionalOnMissingBean
public RetryRegistry retryRegistry(final ResilienceProperties properties) {
    final ResilienceProperties.RetryProperties retryProps = properties.getRetry();

    final RetryConfig.Builder<?> builder = RetryConfig.custom()
            .maxAttempts(retryProps.getMaxAttempts())
            .retryOnException(e -> !(e instanceof NonRetryableException));

    if (retryProps.isEnableExponentialBackoff()) {
        builder.intervalFunction(IntervalFunction
                .ofExponentialBackoff(
                        retryProps.getWaitDuration(),
                        retryProps.getExponentialBackoffMultiplier()));
    } else {
        builder.waitDuration(retryProps.getWaitDuration());
    }

    return RetryRegistry.of(builder.build());
}

Two things to note. The retryOnException predicate excludes NonRetryableException — we’ll get to that in a moment. And when enable-exponential-backoff is false, it falls back to a fixed interval between attempts.

NonRetryableException: When to Stop Trying

Not every failure is transient. “Payment declined” will never succeed on retry — the credit card is invalid. “Insufficient stock” is deterministic — the warehouse genuinely doesn’t have the product. Retrying these wastes time, wastes resources, and — if the circuit breaker is counting — burns through your failure budget for no reason.

The framework defines a marker interface:

public interface NonRetryableException {
    // Marker interface — business exceptions implement this to skip retry
}

Service exceptions opt in:

public class InsufficientStockException extends RuntimeException
        implements NonRetryableException {
    public InsufficientStockException(String message) {
        super(message);
    }
}

public class PaymentDeclinedException extends RuntimeException
        implements NonRetryableException {
    public PaymentDeclinedException(String message) {
        super(message);
    }
}

Why a marker interface instead of a base class? Because these exceptions already extend RuntimeException. Java doesn’t have multiple inheritance, but it does have multiple interfaces. The marker lets any exception opt out of retry without changing its class hierarchy.

The retry configuration’s predicate is one line:

.retryOnException(e -> !(e instanceof NonRetryableException))

When retry encounters one of these, it fails immediately. No backoff, no additional attempts. But the circuit breaker still records it as a failure — it still counts toward the failure rate threshold. This is the right behavior. If a service is returning “payment declined” for every single request, something is systematically wrong, and the circuit breaker should trip.

Retry Observability

Resilience4j publishes events for every retry attempt, and the framework hooks into them for structured logging and a custom metric:

public class RetryEventListener {

    public RetryEventListener(final RetryRegistry retryRegistry,
                              final MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;

        retryRegistry.getAllRetries().forEach(this::registerListeners);
        retryRegistry.getEventPublisher().onEntryAdded(
                event -> registerListeners(event.getAddedEntry()));
    }

    private void registerListeners(final Retry retry) {
        final var eventPublisher = retry.getEventPublisher();
        eventPublisher.onRetry(this::onRetry);
        eventPublisher.onSuccess(this::onSuccess);
        eventPublisher.onError(this::onError);
        eventPublisher.onIgnoredError(this::onIgnoredError);
    }
}

Four event types give you the full picture:

Event	Log Level	What happened
onRetry	WARN	An attempt failed, trying again
onSuccess	INFO	Eventually succeeded
onError	ERROR	All retries exhausted
onIgnoredError	INFO	Non-retryable, skipped retry

That last one — onIgnoredError — needed a custom Micrometer counter because Resilience4j’s built-in TaggedRetryMetrics doesn’t track ignored errors:

private void onIgnoredError(final RetryOnIgnoredErrorEvent event) {
    logger.info("Non-retryable exception for '{}', skipping retry: {}",
            event.getName(), event.getLastThrowable().getMessage());

    Counter.builder("framework.resilience.retry.ignored")
            .description("Count of non-retryable exceptions that skipped retry")
            .tag("name", event.getName())
            .register(meterRegistry)
            .increment();
}

In practice, the logs tell you a clear story. A transient failure that recovers:

WARN  RetryEventListener - Retry attempt #1 for 'payment-processing': Connection refused
WARN  RetryEventListener - Retry attempt #2 for 'payment-processing': Connection refused
INFO  RetryEventListener - 'payment-processing' succeeded after 2 attempt(s)

A business exception that gets kicked straight to the dead letter queue:

INFO  RetryEventListener - Non-retryable exception for 'payment-processing',
      skipping retry: Insufficient funds for amount 15000.00

The ResilienceException Wrapper

When an operation exhausts all retries or gets rejected by an open circuit breaker, the framework wraps the failure in a ResilienceException:

public class ResilienceException extends RuntimeException {

    private final String operationName;

    public ResilienceException(String message, String operationName, Throwable cause) {
        super(message, cause);
        this.operationName = operationName;
    }
}

The operationName field tells downstream handlers which circuit breaker failed. The dead letter queue integration (Part 9) uses this to classify failures:

if (error instanceof ResilienceException) {
    logger.warn("Circuit breaker open, saga step deferred: eventId={}", eventId);
} else {
    logger.error("Failed to process event: {}", eventId, error);
}

Auto-Configuration

The whole resilience stack is wired through a single auto-configuration class:

@Configuration
@ConditionalOnClass(CircuitBreakerRegistry.class)
@ConditionalOnProperty(name = "framework.resilience.enabled", matchIfMissing = true)
@EnableConfigurationProperties(ResilienceProperties.class)
public class ResilienceAutoConfiguration {

    @Bean @ConditionalOnMissingBean
    public CircuitBreakerRegistry circuitBreakerRegistry(ResilienceProperties properties) { ... }

    @Bean @ConditionalOnMissingBean
    public RetryRegistry retryRegistry(ResilienceProperties properties) { ... }

    @Bean @ConditionalOnMissingBean
    public ResilientServiceInvoker resilientServiceInvoker(...) { ... }

    @Bean @ConditionalOnMissingBean(TaggedCircuitBreakerMetrics.class)
    public TaggedCircuitBreakerMetrics taggedCircuitBreakerMetrics(...) { ... }

    @Bean @ConditionalOnMissingBean(TaggedRetryMetrics.class)
    public TaggedRetryMetrics taggedRetryMetrics(...) { ... }

    @Bean @ConditionalOnMissingBean
    public RetryEventListener retryEventListener(...) { ... }
}

Three conditionals control activation. @ConditionalOnClass(CircuitBreakerRegistry.class) means the whole thing only activates when Resilience4j is on the classpath — services that don’t include the dependency don’t get any resilience beans. @ConditionalOnProperty(…, matchIfMissing = true) means it’s enabled by default; set framework.resilience.enabled=false to turn it off. And every individual bean is @ConditionalOnMissingBean, so the application can override any piece by defining its own bean.

Six beans total:

CircuitBreakerRegistry — circuit breaker instances, configured from properties
RetryRegistry — retry instances with optional exponential backoff
ResilientServiceInvoker — the decorator that wraps operations
TaggedCircuitBreakerMetrics — binds circuit breaker metrics to Micrometer
TaggedRetryMetrics — binds retry metrics to Micrometer
RetryEventListener — structured logging and the custom ignored-error counter

Per-Instance Tuning

Different saga steps have different tolerance for failure. Stock reservation should be fast and reliable — if it’s failing, something is seriously wrong, and we want the circuit to trip quickly. Payment processing, on the other hand… payment providers are notoriously flaky. You’d rather tolerate a higher failure rate and give the provider more time to sort itself out before you start rejecting everything.

The framework supports per-instance overrides in each service’s application.yml:

framework:
  resilience:
    enabled: true
    circuit-breaker:
      failure-rate-threshold: 50
      wait-duration-in-open-state: 10s
      sliding-window-size: 10
      minimum-number-of-calls: 5
      permitted-number-of-calls-in-half-open-state: 3
    retry:
      max-attempts: 3
      wait-duration: 500ms
      enable-exponential-backoff: true
      exponential-backoff-multiplier: 2.0
    instances:
      inventory-stock-reservation:
        circuit-breaker:
          failure-rate-threshold: 40
          wait-duration-in-open-state: 5s
        retry:
          max-attempts: 2
      payment-processing:
        circuit-breaker:
          failure-rate-threshold: 60
          wait-duration-in-open-state: 15s
        retry:
          max-attempts: 5
          wait-duration: 1s

The instances map lets any named circuit breaker override the defaults:

public CircuitBreakerProperties getCircuitBreakerForInstance(final String name) {
    final InstanceProperties instance = instances.get(name);
    if (instance != null && instance.getCircuitBreaker() != null) {
        return instance.getCircuitBreaker();
    }
    return circuitBreaker; // Fall back to defaults
}

So in this configuration, inventory-stock-reservation trips at 40% failure rate with a 5-second open state and only 2 retry attempts — stock checks are idempotent and fast, no point dragging things out. payment-processing tolerates 60% failure rate with a 15-second open state and 5 retries starting at 1-second intervals. With exponential backoff, that last attempt waits about 16 seconds. Payment providers get the patience they’ve trained us to give them.

Metrics and Monitoring

The auto-configuration binds circuit breaker and retry metrics to Micrometer, which exports to Prometheus for Grafana dashboards:

Circuit Breaker Metrics

Metric	Type	Description
resilience4j_circuitbreaker_state	Gauge	Current state (0=CLOSED, 1=OPEN, 2=HALF_OPEN)
resilience4j_circuitbreaker_calls_total	Counter	Total calls by outcome (successful, failed, not_permitted)
resilience4j_circuitbreaker_failure_rate	Gauge	Current failure rate percentage
resilience4j_circuitbreaker_buffered_calls	Gauge	Calls in sliding window

Retry Metrics

Metric	Type	Description
resilience4j_retry_calls_total	Counter	Total calls by outcome (successful_without_retry, successful_with_retry, failed_with_retry, failed_without_retry)
framework.resilience.retry.ignored	Counter	Non-retryable exceptions (tagged by name)

These feed into Grafana panels for saga health — circuit breaker state timeline showing when breakers trip and recover, retry rate over time where a spike tells you something transient is happening, failure rate broken out by saga step so you can see which one is misbehaving, and the non-retryable exception count that separates business logic failures from infrastructure problems.

Configuration Reference

framework.resilience.*

Property	Default	Description
enabled	true	Master toggle for all resilience features
circuit-breaker.failure-rate-threshold	50	Failure rate (%) to trip the breaker
circuit-breaker.wait-duration-in-open-state	10s	How long to stay open before testing
circuit-breaker.sliding-window-size	10	Number of calls in the measurement window
circuit-breaker.sliding-window-type	COUNT_BASED	COUNT_BASED or TIME_BASED
circuit-breaker.minimum-number-of-calls	5	Minimum calls before evaluating failure rate
circuit-breaker.permitted-number-of-calls-in-half-open-state	3	Test calls in half-open state
retry.max-attempts	3	Maximum retry attempts (including initial)
retry.wait-duration	500ms	Base wait between retries
retry.enable-exponential-backoff	true	Use exponential backoff
retry.exponential-backoff-multiplier	2.0	Backoff multiplier
instances.<name>.circuit-breaker.*	(defaults)	Per-instance circuit breaker overrides
instances.<name>.retry.*	(defaults)	Per-instance retry overrides

What’s Next

Circuit breakers and retry handle one category of failure: transient problems during event consumption. The saga listener tries, the call fails, the retry policy kicks in, the circuit breaker keeps the damage from spreading. That covers the consumer side.

But what about the producer side? When EventSourcingController needs to republish an event to the shared cluster and the cluster is temporarily unreachable, the event just… vanishes. No retry. No circuit breaker. Gone.

That’s a different failure mode, and it needs a different mechanism. In Part 8, we add the transactional outbox pattern — a durable buffer between event production and cross-cluster delivery that guarantees no events are lost, even when the shared cluster is down. Then Part 9 closes the loop with dead letter queues and idempotency guards for events that exhaust all retries or arrive more than once.

Next up: The Transactional Outbox Pattern with Hazelcast

Previous: MCP Server for Microservices: AI-Powered Debugging

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

June 15, 2026

On Debugging by Assumption

A short interstitial in the “Building Event-Driven Microservices with Hazelcast” series

“It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.” — Mark Twain

Measurement is better than guessing. Who knew?

When the saga implementation was first finished and we ran through the test scenarios for the first time, there was a high incidence of saga timeouts if we ran for more than a few minutes. (5 minutes was great; 30 minutes was ugly.)

I didn’t ask Claude to investigate, or do any analysis of my own, because I had a pretty good suspicion what was going on. Everything was running on a single laptop — 16GB of memory split across a 3-node Hazelcast cluster, 4 services each running an embedded Hazelcast node, Docker itself, and I really hadn’t bothered to shut off my normal desktop workload. Web browser, email, whatever else. I figured I’d maxed the poor thing out and probably wouldn’t get a clean timeout-free run until I deployed to a multi-node cluster in the cloud.

That didn’t happen for some time. I’m cheap, and I wasn’t going to pay for cloud resources until I had a full-blown demo ready to go. When the day came — much later in the story if I was telling it chronologically — I saw the same pattern of timeouts. Turns out, it was never thread starvation or lack of resources. It was a combination of things, and none of them were what I’d assumed.

When faced with this reality, I asked Claude to troubleshoot the issue, and this is one of the times I was most impressed with how Claude approached a problem compared to how I would have.

In most debugging scenarios, I look only until I find the first reasonable suspect. Why keep looking if you’ve already found what you’re looking for? Fix, rebuild, retest, and on a good day, that’s the end of it. On a bad day, you’re still looking at the same issue, so you start hunting for suspect #2. Lather, rinse, repeat.

Claude came back with four identified problems. The main one was subtle: we generated product data up front and gave each item a reasonable starting stock. As the demo ran, orders depleted the stock, and eventually the inventory service started throwing InsufficientStockException — correct behavior, you can’t sell what you don’t have. But the circuit breaker we’d added for resilience was treating that business error the same as an infrastructure failure. Enough “failures” in the sliding window and the circuit breaker tripped open, rejecting all orders — including ones for products that still had stock. Sagas piled up with nowhere to go, the timeout detector found hundreds of them every cycle, and the system drowned in compensation events. At the peak: 64,000 timeouts from 53,000 sagas started.

The other three fixes addressed related gaps. Business failures like out-of-stock now trigger immediate saga compensation instead of waiting for the timeout detector to notice. A NonRetryableException marker interface tells the circuit breaker not to count deterministic business errors against the failure rate. And an automatic stock replenishment monitor keeps the demo in a steady state where orders can actually succeed for hours instead of wedging after the first few minutes.

I should have investigated the saga timeouts when they first appeared, rather than assuming the problem would magically go away with more hardware. And when I did get around to investigating, Claude’s approach of identifying all the contributing problems at once was considerably more effective than my usual one-suspect-at-a-time strategy.

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

June 1, 2026

Saga Pattern: Distributed Transactions Without 2PC

Part 4 in the “Building Event-Driven Microservices with Hazelcast” series

Introduction

In the first three articles, we built an event sourcing framework with a Jet pipeline and materialized views. Each service is self-contained — it owns its events, its views, and its data. And one of the things that makes event-driven architecture powerful is that event publishing is fire-and-forget. The publisher doesn’t know or care who’s receiving the messages, or whether anyone acted on them properly.

But someone cares. Probably your boss.

That decoupling is a feature, not a bug — it’s exactly what gives you the independence to evolve services separately. If you wanted to add a loyalty program based on orders placed, you wouldn’t touch a single line of existing code. You’d build the new functionality and subscribe it to OrderCreated events. Done. No coordination, no release trains, no “can you add a field to your response for us.”

But that same independence becomes a problem when a business operation requires coordination. Placing an order in our eCommerce system spans four services:

Order Service creates the order
Inventory Service reserves stock
Payment Service charges the customer
Order Service confirms the order

If payment fails after stock is reserved, we need to release the stock. If the inventory service is down, we need to cancel the order. The fire-and-forget publisher doesn’t know any of this happened — and nobody else is keeping track unless we build something that does.

Now — a real eCommerce system wouldn’t cancel an order just because the inventory service hiccupped. You’d create a backorder, or queue the reservation for retry, or do whatever it takes to keep the customer’s money. Nobody in the business is going to say “yeah, let’s give up on that sale because a container restarted.” But we’re building a framework to demonstrate microservice patterns, not to compete with Shopify. The cancel-and-compensate flow shows sagas at their most illustrative: forward steps, failure detection, compensation, and state tracking across services. That’s the machinery we want to examine. So we went with the version that best shows how the pattern works, not the version that best sells laptops.

This is the distributed transaction problem, and sagas are the solution.

Why Sagas? Because the Alternative Is Worse.

We didn’t adopt event sourcing because it was fashionable. We adopted it because the alternative — mutable state scattered across services with no audit trail — was worse. Sagas are the same kind of choice.

When a business operation touches one database, you wrap it in a transaction. ACID gives you atomicity: either all the changes commit or none of them do. Every developer learns this in their first database course, and it works.

But our order fulfillment touches four services, each with its own data. A single database transaction can’t span them. So you reach for two-phase commit — 2PC — the textbook answer for distributed transactions. And that’s where the trouble starts.

2PC works by appointing a coordinator that asks each participant “can you commit?” and then, once everyone says yes, tells them all to go ahead. The problem is what happens between those two phases. Every participant is holding locks — on inventory rows, on payment records, on order state — while waiting for the coordinator’s decision. In a monolith, that wait is microseconds. Across a network, it’s milliseconds at best, and if the coordinator is slow or the network partitions, those locks can be held for seconds. Or longer.

Think about what that means for throughput. While one order is sitting in that limbo state between “prepare” and “commit,” no other transaction can touch those same inventory rows. At 50 orders per second, you’ve got 50 sets of distributed locks competing for the same resources, each one waiting on network round-trips to a coordinator that is itself a single point of failure. If the coordinator crashes mid-protocol, those locks stay held until someone intervenes manually. The participants are stuck — they’ve promised to commit but haven’t been told to, and they can’t unilaterally release the locks without risking inconsistency.

It’s not that 2PC is theoretically wrong. It works fine in environments where all participants are on the same local network, latency is sub-millisecond, and the coordinator never fails. That environment is called “a single database server.” Once you’ve distributed beyond that, you’re fighting physics.

Sagas take a fundamentally different approach. Instead of one big atomic transaction with distributed locks, you execute a sequence of local transactions — each one fast, each one touching only one service’s data, each one committing immediately. If step 3 fails, you don’t roll back steps 1 and 2 by releasing locks. You compensate — you issue new transactions that undo the business effect of the earlier steps. Reserve stock, then charge payment, then… payment fails? Issue a stock release. The saga doesn’t pretend the work never happened. It acknowledges what happened and fixes it forward.

The trade-off is that you give up the clean atomicity of “all or nothing” and accept eventual consistency — a window of time where stock is reserved but payment hasn’t been attempted yet, or payment has been charged but the order isn’t confirmed. In our system, that window is typically under a second. For the user, the order is “processing.” For the system, the saga is in flight. And if something fails, compensation events fire and the system converges to a consistent state — just not instantaneously.

Choreography and Orchestration

There are two ways to coordinate a saga. They represent genuinely different architectural philosophies, and neither is universally right.

Choreography: Services React to Events

In a choreographed saga, there’s no coordinator. Each service listens for events that concern it and reacts:

Order Service publishes OrderCreated -->
  Inventory Service hears it, reserves stock, publishes StockReserved -->
    Payment Service hears it, charges customer, publishes PaymentProcessed -->
      Order Service hears it, confirms order

The flow emerges from the interactions between services. No single component knows the full sequence. Each service knows only “when I see event X, I do Y and publish Z.”

Orchestration: A Central Controller Directs the Flow

In an orchestrated saga, one component — the orchestrator — knows the whole sequence and directs each step:

Orchestrator --> "Order Service, create order"
Orchestrator --> "Inventory Service, reserve stock"
Orchestrator --> "Payment Service, charge customer"
Orchestrator --> "Order Service, confirm order"

The orchestrator is a state machine. It tracks where the saga is, what comes next, and what to undo if something fails.

Choosing Between Them

Choreography is the simpler choice when your saga is a linear chain — A triggers B triggers C triggers D — and the services are already publishing events for other reasons. If your system is event-driven (ours is), choreographed sagas are almost free. The services are already emitting events. The saga is just another consumer. No new infrastructure, no single point of failure, and each service stays fully independent.

But choreography gets painful as the flow gets complex. Say step 2 needs to branch — charge a credit card or apply store credit, depending on the payment method. Now you’re encoding conditional logic across multiple listeners that can’t see each other. If someone asks “what does this saga actually do, end to end?” the answer is “go read three listener classes in three different services and piece it together.” For a four-step linear chain, fine. For a ten-step flow with branches and conditional skips, that’s a mess.

That’s where orchestration earns its keep. The entire saga — forward steps, compensation steps, timeouts, retry logic — lives in a single definition file. You read it top to bottom. You can add per-step timeouts, per-step retries. The caller can wait for the result synchronously. The cost is coupling: the orchestrator needs to know about every service’s endpoint, and it becomes a component you have to keep running.

A rough rule of thumb:

Start with choreography when the saga is a linear chain with fewer than about five steps, the services already publish events, and you don’t need a synchronous response. Move to orchestration when you need branching or conditional logic, per-step timeout and retry control, a synchronous success/failure response for the caller, or when the saga has gotten complex enough that nobody can trace it across listeners without a whiteboard.

We started with choreography for order fulfillment because it’s a clean four-step chain and our entire architecture is already built around events. Later, when we needed synchronous responses for certain API paths and wanted per-step timeout control, we added orchestration as a second option. Both patterns coexist in the framework, running against the same saga state store. We’ll cover the orchestration implementation and how we run both side by side in a later article.

The Order Fulfillment Saga

Here’s the full flow for the choreographed version:

Order Fulfillment Saga happy path — four services emitting OrderCreated, StockReserved, PaymentProcessed, OrderConfirmed in sequence, with saga state transitioning from STARTED through IN_PROGRESS to COMPLETED

Event Context Propagation

Every saga event carries metadata that links the chain together:

// These fields are present on every saga event
String sagaId;         // Links all events in one saga instance
String correlationId;  // Links to the original API request
String sagaType;       // "OrderFulfillment"
int stepNumber;        // Which step (0, 1, 2, 3)
boolean isCompensating; // True for compensation events

The sagaId is generated when the order is created and propagated through every subsequent event. That’s how the Inventory Service knows which saga a StockReserved event belongs to, and how the Payment Service gets the payment details — amount, currency, method — from the StockReserved event’s context fields.

Implementing Saga Listeners

Each service has a saga listener that subscribes to relevant events via Hazelcast ITopic. Here’s the Payment Service’s:

@Component
public class PaymentSagaListener {

    public PaymentSagaListener(
            @Qualifier("hazelcastClient") HazelcastInstance hazelcast,
            PaymentService paymentService,
            SagaStateStore sagaStateStore) {

        // Listen for StockReserved --> process payment
        ITopic<GenericRecord> stockReservedTopic = hazelcast.getTopic("StockReserved");
        stockReservedTopic.addMessageListener(message -> {
            GenericRecord event = message.getMessageObject();
            String sagaId = event.getString("sagaId");
            String orderId = event.getString("orderId");
            String amount = event.getString("paymentAmount");
            String currency = event.getString("paymentCurrency");
            String method = event.getString("paymentMethod");

            paymentService.processPaymentForOrder(
                orderId, customerId, amount, currency, method,
                sagaId, event.getString("correlationId")
            );
        });

        // Listen for PaymentRefundRequested --> process refund
        ITopic<GenericRecord> refundTopic = hazelcast.getTopic("PaymentRefundRequested");
        refundTopic.addMessageListener(message -> {
            GenericRecord event = message.getMessageObject();
            String paymentId = event.getString("paymentId");
            String sagaId = event.getString("sagaId");

            paymentService.refundPaymentForSaga(
                paymentId, "Saga compensation", sagaId,
                event.getString("correlationId")
            );
        });
    }
}

The listener uses @Qualifier(“hazelcastClient”) — it connects to the shared cluster, not the embedded instance. That’s the dual-instance architecture from Part 2. Each listener is a plain @Component that Spring creates on startup; the ITopic subscription stays active for the life of the service. And the listener itself doesn’t contain business logic — it unpacks the event, delegates to the service layer, and gets out of the way.

Compensation: Undoing Work

When a step fails, we need to undo the work completed by previous steps. This is compensation — often described as the saga equivalent of a rollback, though it isn’t really a rollback at all. More on that in a minute.

Compensation Flow: Payment Fails

Compensation Flow: Stock Unavailable

StockReservationFailed event
         |
         '---> Order Service
               '-- Cancels order (status: CANCELLED)
               '-- No stock release needed (nothing was reserved)

         Saga finalized as COMPENSATED

The Compensation Registry

A CompensationRegistry maps each forward event to its compensating event and responsible service:

@Configuration
public class ECommerceCompensationConfig {

    @Bean
    public CompensationRegistrar ecommerceCompensations(CompensationRegistry registry) {
        // Step 0: OrderCreated --> compensate with OrderCancelled
        registry.register("OrderCreated", "OrderCancelled", "order-service");

        // Step 1: StockReserved --> compensate with StockReleased
        registry.register("StockReserved", "StockReleased", "inventory-service");

        // Step 2: PaymentProcessed --> compensate with PaymentRefunded
        registry.register("PaymentProcessed", "PaymentRefunded", "payment-service");

        // Step 3: OrderConfirmed --> no compensation (terminal success)
        return () -> {};
    }
}

This is the one place that documents the entire saga structure. Even though the execution is distributed across listeners, the registry makes the step-to-compensation mapping readable in a single file. That’s something people miss about choreography — it doesn’t mean the structure is undocumented, it means the structure is declared separately from the execution.

Saga State Tracking

The SagaState class is an immutable state machine that tracks each saga instance:

public class SagaState implements Serializable {

    private final String sagaId;
    private final String sagaType;         // "OrderFulfillment"
    private final String correlationId;
    private final SagaStatus status;       // STARTED --> IN_PROGRESS --> COMPLETED
    private final List<SagaStepRecord> steps;
    private final Instant startedAt;
    private final Instant deadline;        // Absolute timeout
}

Status transitions:

STARTED --> IN_PROGRESS --> COMPLETED     (happy path)
STARTED --> IN_PROGRESS --> COMPENSATING --> COMPENSATED  (failure + recovery)
STARTED --> IN_PROGRESS --> TIMED_OUT --> COMPENSATING --> COMPENSATED  (timeout)
STARTED --> IN_PROGRESS --> FAILED        (unrecoverable)

The state lives in a HazelcastSagaStateStore backed by an IMap on the shared cluster. Every service can read and update saga state because they all connect to the same shared cluster via the client instance.

Each step gets recorded:

public class SagaStepRecord implements Serializable {

    private final int stepNumber;
    private final String eventType;
    private final StepStatus status;    // PENDING, COMPLETED, FAILED, COMPENSATED
    private final Instant completedAt;
    private final String failureReason;
}

Timeout Handling

Sagas can get stuck. A service might be down, a message might be lost, or a listener might throw an unhandled exception. Without timeout detection, a stuck saga hangs forever — stock reserved but never charged, or charged but never confirmed.

The SagaTimeoutDetector is a scheduled service that runs in each service instance:

@Component
public class SagaTimeoutDetector {

    @Scheduled(fixedDelayString = "${saga.timeout.check-interval:5000}")
    public void detectTimeouts() {
        List<SagaState> timedOut = sagaStateStore.findTimedOutSagas(
            maxBatchSize
        );

        for (SagaState saga : timedOut) {
            sagaStateStore.markTimedOut(saga.getSagaId());

            if (autoCompensate) {
                compensator.compensate(saga);
            }

            applicationEventPublisher.publishEvent(
                new SagaTimedOutEvent(saga)
            );

            sagaMetrics.recordSagaTimedOut(saga.getSagaType());
        }
    }
}

Every five seconds, it looks for sagas that have blown past their deadline. When it finds one, it marks it timed out, triggers compensation for whatever steps already completed, publishes a Spring event for logging and alerting, and records the metric.

Timeout behavior is configurable per service:

saga:
  timeout:
    enabled: true
    check-interval: 5000          # Check every 5 seconds
    default-deadline: 30000       # 30-second default timeout
    auto-compensate: true         # Auto-trigger compensation
    max-batch-size: 100           # Process up to 100 timeouts per check
    saga-types:
      OrderFulfillment: 60000     # 60 seconds for order fulfillment

The Order Fulfillment saga gets 60 seconds — longer than the 30-second default — because it spans four services and includes payment processing, which can be slow. Choosing the right timeout value is harder than it sounds, and we have quite a bit more to say about that in the next article.

Monitoring Sagas

The SagaMetrics class exposes counters and timers to Prometheus:

Metric	What It Tells You
saga.started	How many sagas are being created
saga.completed	How many succeed end-to-end
saga.compensated	How many required rollback
saga.timedout	How many exceeded their deadline
saga.duration (p95)	How long sagas are taking
saga.compensation.duration (p95)	How long compensation takes

The pre-provisioned Saga Dashboard in Grafana visualizes active saga counts, throughput rates by saga type, duration percentiles, success rates, timeout detection rates, and compensation breakdowns. Pre-configured alerts fire on high failure rates, timeouts, compensation failures, and success rate drops below 90%.

The dashboard is useful. But the metric that actually saved us during a sustained load incident was dead simple: saga.timedout plotted alongside saga.completed. When timeouts exceeded completions, we knew we had a systemic problem, not just a few slow sagas. More on that story next time.

Design Decisions and Trade-offs

Eventual Consistency

Sagas provide eventual consistency, not immediate consistency. Between the time stock is reserved and payment is processed, the system is in a “pending” state. That’s intentional — the trade-off buys us service independence and availability. For our use case, the consistency window is sub-second. For domains where that’s not acceptable (financial settlement, medical records), you’d need stronger guarantees.

Idempotency

Saga listeners must be idempotent. ITopic delivery is at-most-once in Hazelcast, but the timeout detector might trigger compensation for a saga that was actually processing — just slowly. If the original flow completes after compensation starts, the system has to handle the overlap gracefully.

SagaState handles this with updateOrAddStep, which replaces by step number. A step can’t be recorded twice.

Compensation Is Not Rollback

This is worth being explicit about, because the terminology invites confusion.

Compensation undoes the business effect, not the technical state. When we “refund a payment,” we don’t delete the payment record. We create a new PaymentRefundedEvent that changes the payment status to REFUNDED. The event history preserves the full story: payment was processed, then refunded. If an auditor comes asking, the trail is complete.

This fits naturally with event sourcing. The compensation is just another event. Nothing is erased, nothing is pretended away. The system records what actually happened, including the part where something went wrong and was corrected.

Summary

The saga pattern solves distributed transaction coordination without distributed locks:

Choreographed sagas fit naturally with event sourcing — each service reacts to events independently, no coordinator required. Orchestrated sagas earn their place when flows grow complex, need synchronous responses, or require per-step control. Compensation provides rollback-like semantics through new events, preserving the full history. Timeout detection catches stuck sagas. Saga state tracking gives you a complete audit trail.

The result is a system where four independent services coordinate a complex business transaction without any service knowing about the others — they only know about events. The cost is that you’re managing a distributed state machine instead of a database transaction. But the alternative — distributed locks held across network calls while a coordinator decides everyone’s fate — is worse.

Next up: Saga Timeouts — When Distributed Things Go Wrong

Previous: Materialized Views for Fast Queries

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

May 25, 2026