Resilience – Nodes and Edges

Part 9 in the “Building Event-Driven Microservices with Hazelcast” series

Over the past two articles, we built resilience into both sides of our saga communication. Part 7 added circuit breakers and retry to protect saga listeners against transient failures during event consumption. Part 8 added the transactional outbox to guarantee event delivery from producer to shared cluster.

Two gaps remain.

First: what happens when an event fails processing permanently? The circuit breaker exhausts retries. NonRetryableException gets thrown. The event is gone — all that survives is a log message. There’s no way to inspect what failed, understand why, or retry it later when someone fixes the underlying problem.

Second: what happens when the outbox delivers an event twice? At-least-once delivery means duplicates are possible. Without protection, the Inventory Service might reserve stock twice for the same order. The Payment Service might charge the customer twice.

This article covers two complementary patterns that close these gaps. The dead letter queue captures events that fail consumer-side processing, giving operators a way to inspect, replay, and discard them. The idempotency guard ensures each event is processed exactly once, even if delivered multiple times.

Put them together with the outbox and you get effectively-once semantics — at-least-once delivery on the producer side, exactly-once processing on the consumer side. That’s the gold standard for event-driven systems.

Part 1: Dead Letter Queue

The Problem

Consider this failure sequence in the Inventory Service’s saga listener:

Permanent failure without a dead letter queue: an OrderCreated event arrives, the InventorySagaListener picks it up and calls executeWithResilience, a non-retryable InsufficientStockException is thrown, the circuit breaker records the failure, a ResilienceException propagates, the error is logged, and the event is gone with only a log line surviving

That log message? That’s all you’ve got. In production, recovering from this means searching logs for the event ID, reconstructing the event payload from other sources, manually fixing whatever went wrong, and then figuring out how to re-trigger the saga step.

A dead letter queue captures the failed event — payload, failure reason, saga context, source service, everything — in a durable store that you can actually query and act on.

The DeadLetterEntry

Each DLQ entry preserves the full failure context:

public class DeadLetterEntry {

    private String dlqEntryId;       // UUID — unique DLQ identifier
    private String originalEventId;  // The event that failed
    private String eventType;        // "OrderCreated", "StockReserved", etc.
    private String topicName;        // The ITopic where the event was published
    private GenericRecord eventRecord; // The complete event payload for replay
    private String failureReason;    // Why processing failed
    private Instant failureTimestamp; // When the failure occurred
    private String sourceService;    // Which service failed ("inventory-service")
    private String sagaId;           // Saga context for tracing
    private String correlationId;    // Correlation context for tracing
    private int replayCount;         // How many times this entry has been replayed
    private Status status;           // PENDING, REPLAYED, or DISCARDED

    public enum Status {
        PENDING,    // Awaiting review or replay
        REPLAYED,   // Re-published to original topic
        DISCARDED   // Manually discarded by administrator
    }
}

Construction at the failure site uses a builder:

DeadLetterEntry.builder()
    .originalEventId(eventId)
    .eventType(record.getString("eventType"))
    .topicName("OrderCreated")
    .eventRecord(record)
    .failureReason(error.getMessage())
    .sourceService("inventory-service")
    .sagaId(record.getString("sagaId"))
    .correlationId(record.getString("correlationId"))
    .build();

The eventRecord field is the important one — it holds the complete GenericRecord that was published to the ITopic. When you replay the entry, this exact record gets re-published to the original topic, picking the saga back up where it left off.

The DeadLetterQueueOperations Interface

Same interface-extraction pattern we used for ResilientOperations and ServiceClientOperations (Java 25 Mockito can’t mock concrete classes, so we keep extracting interfaces — it’s becoming a running theme):

public interface DeadLetterQueueOperations {

    void add(DeadLetterEntry entry);

    List<DeadLetterEntry> list(int limit);

    DeadLetterEntry getEntry(String dlqEntryId);

    void replay(String dlqEntryId);

    void discard(String dlqEntryId);

    long count();
}

HazelcastDeadLetterQueue: IMap-Backed Storage

The implementation stores DLQ entries as Compact-serialized GenericRecords in a Hazelcast IMap — same pattern as the HazelcastOutboxStore:

public class HazelcastDeadLetterQueue implements DeadLetterQueueOperations {

    private static final String SCHEMA_NAME = "DeadLetterEntry";
    private final HazelcastInstance hazelcast;
    private final IMap<String, GenericRecord> dlqMap;
    private final DeadLetterQueueProperties properties;
    private final MeterRegistry meterRegistry;

    public HazelcastDeadLetterQueue(HazelcastInstance hazelcast,
                                     DeadLetterQueueProperties properties,
                                     MeterRegistry meterRegistry) {
        this.hazelcast = hazelcast;
        this.dlqMap = hazelcast.getMap(properties.getMapName());
        // ...
    }
}

The DLQ map lives on the shared cluster (falling back to the embedded instance if there’s no shared cluster), so it’s accessible from any service’s admin endpoint. You don’t need to know which service failed — query the DLQ from anywhere and you’ll see everything.

POJO-to-GenericRecord Conversion

Like the outbox store, conversion happens at the boundary:

static GenericRecord toRecord(final DeadLetterEntry entry) {
    return GenericRecordBuilder.compact(SCHEMA_NAME)
            .setString("dlqEntryId", entry.getDlqEntryId())
            .setString("originalEventId", entry.getOriginalEventId())
            .setString("eventType", entry.getEventType())
            .setString("topicName", entry.getTopicName())
            .setGenericRecord("eventRecord", entry.getEventRecord())
            .setString("failureReason", entry.getFailureReason())
            .setInt64("failureTimestamp", entry.getFailureTimestamp().toEpochMilli())
            .setString("sourceService", entry.getSourceService())
            .setString("sagaId", entry.getSagaId())
            .setString("correlationId", entry.getCorrelationId())
            .setInt32("replayCount", entry.getReplayCount())
            .setString("status", entry.getStatus().name())
            .build();
}

Note setGenericRecord(“eventRecord”, …) — Compact serialization handles nested GenericRecords natively. The full event payload comes along for the ride without any special serialization work on our part.

Replay

This is where the DLQ earns its keep. Once you’ve figured out what went wrong and fixed it — restocked inventory, restarted a flaky service, whatever — you replay the entry:

@Override
public void replay(final String dlqEntryId) {
    final GenericRecord record = dlqMap.get(dlqEntryId);
    if (record == null) {
        throw new IllegalArgumentException("DLQ entry not found: " + dlqEntryId);
    }

    final DeadLetterEntry entry = fromRecord(record);

    if (entry.getStatus() != DeadLetterEntry.Status.PENDING) {
        throw new IllegalStateException(
                "Cannot replay entry in status " + entry.getStatus());
    }
    if (entry.getReplayCount() >= properties.getMaxReplayAttempts()) {
        throw new IllegalStateException(
                "Max replay attempts (" + properties.getMaxReplayAttempts() + ") exceeded");
    }

    // Re-publish to the original topic
    final GenericRecord eventRecord = entry.getEventRecord();
    if (eventRecord != null && entry.getTopicName() != null) {
        final ITopic<GenericRecord> topic = hazelcast.getTopic(entry.getTopicName());
        topic.publish(eventRecord);
    }

    // Update entry status
    entry.setReplayCount(entry.getReplayCount() + 1);
    entry.setStatus(DeadLetterEntry.Status.REPLAYED);
    dlqMap.set(dlqEntryId, toRecord(entry));

    meterRegistry.counter("dlq.entries.replayed").increment();
}

A few safety guards here. Only PENDING entries can be replayed — you can’t accidentally replay something that was already replayed or discarded. There’s a configurable max replay count (default 3) to prevent infinite replay loops if the underlying issue isn’t actually fixed. And if the eventRecord is somehow null (shouldn’t happen, but defensive coding), the status updates without attempting a publish.

Monitoring Queue Depth

The count() method uses a Hazelcast predicate to count only PENDING entries:

@Override
public long count() {
    final Collection<GenericRecord> pending = dlqMap.values(
            Predicates.equal("status", DeadLetterEntry.Status.PENDING.name()));
    return pending.size();
}

A DLQ count above zero for more than a few minutes is a flag that something needs attention. Wire this to an alert and you’ll know about failed events before anyone files a ticket.

Admin REST Endpoints

The DeadLetterQueueController exposes the DLQ through REST:

@RestController
@RequestMapping("/api/admin/dlq")
@Tag(name = "Dead Letter Queue")
public class DeadLetterQueueController {

    @GetMapping
    public ResponseEntity<List<DeadLetterEntry>> list(
            @RequestParam(defaultValue = "20") int limit) {
        return ResponseEntity.ok(deadLetterQueue.list(limit));
    }

    @GetMapping("/count")
    public ResponseEntity<Map<String, Long>> count() {
        return ResponseEntity.ok(Map.of("count", deadLetterQueue.count()));
    }

    @GetMapping("/{id}")
    public ResponseEntity<DeadLetterEntry> getEntry(@PathVariable String id) { ... }

    @PostMapping("/{id}/replay")
    public ResponseEntity<Map<String, String>> replay(@PathVariable String id) { ... }

    @DeleteMapping("/{id}")
    public ResponseEntity<Map<String, String>> discard(@PathVariable String id) { ... }
}

A typical investigation looks like this:

# How many pending entries?
curl http://localhost:8082/api/admin/dlq/count
# {"count": 2}

# What are they?
curl http://localhost:8082/api/admin/dlq
# [{"dlqEntryId":"abc-123", "originalEventId":"evt-456",
#   "eventType":"OrderCreated", "failureReason":"Insufficient stock for product PROD-789",
#   "sourceService":"inventory-service", "status":"PENDING", ...}]

# Get the full details on one
curl http://localhost:8082/api/admin/dlq/abc-123

# Fix the problem (restock inventory), then replay
curl -X POST http://localhost:8082/api/admin/dlq/abc-123/replay
# {"status":"replayed", "dlqEntryId":"abc-123"}

# Or discard if the saga already timed out and compensation ran
curl -X DELETE http://localhost:8082/api/admin/dlq/abc-123
# {"status":"discarded", "dlqEntryId":"abc-123"}

Integration with Saga Listeners

Each saga listener injects the DLQ as an optional dependency:

@Autowired(required = false)
public void setDeadLetterQueue(DeadLetterQueueOperations deadLetterQueue) {
    this.deadLetterQueue = deadLetterQueue;
}

Failed events get routed to the DLQ in the error handler:

private void sendToDeadLetterQueue(GenericRecord record, String topicName, Throwable error) {
    String eventId = record.getString("eventId");
    if (deadLetterQueue != null) {
        try {
            deadLetterQueue.add(DeadLetterEntry.builder()
                    .originalEventId(eventId)
                    .eventType(record.getString("eventType"))
                    .topicName(topicName)
                    .eventRecord(record)
                    .failureReason(error.getMessage())
                    .sourceService("inventory-service")
                    .sagaId(record.getString("sagaId"))
                    .correlationId(record.getString("correlationId"))
                    .build());
            logger.warn("Event {} sent to DLQ after failure: {}", eventId, error.getMessage());
        } catch (Exception dlqError) {
            logger.error("Failed to send event {} to DLQ: {}", eventId, dlqError.getMessage());
        }
    } else {
        // Fallback: existing behavior (log only)
        if (error instanceof ResilienceException) {
            logger.warn("Circuit breaker open, saga step deferred: eventId={}", eventId);
        } else {
            logger.error("Failed to process event: {}", eventId, error);
        }
    }
}

The try/catch around deadLetterQueue.add() is defensive. If the DLQ itself fails — shared cluster unreachable, say — we fall back to logging. The DLQ is best-effort, not a hard requirement. Losing an event and failing to capture it in the DLQ would be truly unlucky, but it shouldn’t bring the service down.

Part 2: Idempotency Guard

The Problem

The transactional outbox gives us at-least-once delivery. Combined with ITopic’s own delivery behavior (listeners that reconnect after a brief disconnection may receive messages again), the same event can arrive at a consumer more than once:

Duplicate delivery sequence: the OutboxPublisher publishes OrderCreated to the shared cluster, which forwards it to the Inventory Listener; the markDelivered call times out, so on the next poll cycle the publisher re-publishes the same event, and without protection the Inventory Listener performs a double stock reservation

Without protection, inventory gets reserved twice. The customer gets charged twice. The order gets confirmed twice. Nobody wants that.

Atomic Check-and-Claim

The fix is Hazelcast’s putIfAbsent — an atomic, cluster-wide check-and-set that ensures each event ID gets processed exactly once:

public class HazelcastIdempotencyGuard implements IdempotencyGuard {

    private final IMap<String, Long> processedEventsMap;
    private final long ttlMillis;
    private final MeterRegistry meterRegistry;

    public HazelcastIdempotencyGuard(HazelcastInstance hazelcast,
                                      IdempotencyProperties properties,
                                      MeterRegistry meterRegistry) {
        this.processedEventsMap = hazelcast.getMap(properties.getMapName());
        this.ttlMillis = properties.getTtl().toMillis();
        this.meterRegistry = meterRegistry;
    }

    @Override
    public boolean tryProcess(final String eventId) {
        Long previous = processedEventsMap.putIfAbsent(
                eventId, System.currentTimeMillis(), ttlMillis, TimeUnit.MILLISECONDS);

        boolean firstTime = (previous == null);
        meterRegistry.counter("idempotency.checks",
                "result", firstTime ? "miss" : "hit").increment();

        if (!firstTime) {
            logger.debug("Duplicate event detected: eventId={}", eventId);
        }

        return firstTime;
    }
}

The interface is one method:

public interface IdempotencyGuard {
    boolean tryProcess(String eventId);
}

Returns true if this is the first time the event ID has been seen — go ahead and process it. Returns false if someone already claimed it — skip.

How putIfAbsent Works

IMap.putIfAbsent(key, value, ttl, timeUnit) is atomic. If the key doesn’t exist, it inserts the pair and returns null. If it does exist, it returns the existing value and does nothing. This atomicity holds across cluster members — two listeners on different nodes processing the same event simultaneously will never both get null. Exactly one wins, the other backs off.

TTL: Forgetting Old Events

The putIfAbsent includes a TTL (default: 1 hour). After that, the event ID is removed from the map, and the same event could theoretically be reprocessed if it somehow arrived again.

Why an hour? It’s a memory management decision. Without a TTL, the processed events map grows forever. With a 1-hour window, we hold at most an hour’s worth of event IDs, which is bounded and predictable. Since our outbox publisher has a 1-second poll interval with 5 retries, duplicates arrive within seconds — an hour of margin is more than sufficient.

Integration with Saga Listeners

Each saga listener checks the guard at the top of its message handler:

class OrderCreatedListener implements MessageListener<GenericRecord> {

    @Override
    public void onMessage(Message<GenericRecord> message) {
        GenericRecord record = message.getMessageObject();

        String eventId = record.getString("eventId");
        if (idempotencyGuard != null && eventId != null
                && !idempotencyGuard.tryProcess(eventId)) {
            logger.debug("Duplicate event {} already processed, skipping", eventId);
            return;
        }

        // ... proceed with normal processing
    }
}

Three null checks for graceful degradation: if idempotency isn’t configured, process everything (no deduplication). If the event doesn’t have an ID, skip the check. If tryProcess() returns false, it’s a duplicate — drop it silently.

The Processed Events Map

The map lives on the shared cluster, so deduplication works across all service instances:

Key	Value	TTL
evt-abc-123	1738000000000 (timestamp)	1 hour
evt-def-456	1738000001000	1 hour
evt-ghi-789	1738000002000	1 hour

The value — a processing timestamp — is purely informational. Only the key’s presence or absence matters for deduplication. But the timestamp is handy for debugging: it tells you exactly when an event was first processed.

How the Three Patterns Work Together

The outbox, DLQ, and idempotency guard form a complete reliability pipeline:

The three patterns working together: on the producer side the EventSourcingController writes to the OUTBOX IMap, the OutboxPublisher polls and publishes to the shared ITopic, then marks the entry delivered; on the consumer side the saga listener checks the IdempotencyGuard (duplicates are skipped), runs executeWithResilience, and on failure routes the event to the framework_DLQ IMap for admin replay or discard

Let’s walk through what happens when things go wrong.

The OrderCreated event comes out of the Jet pipeline and gets written to the outbox. The OutboxPublisher picks it up, publishes to the shared cluster’s OrderCreated ITopic, and tries to mark it DELIVERED. But the markDelivered call times out. Next poll cycle, the publisher re-publishes the same event. Now it’s been delivered twice.

Over on the consumer side, the Inventory Service’s OrderCreatedListener receives both copies. The first call to idempotencyGuard.tryProcess(“evt-123”) returns true — process it. The second call returns false — duplicate, skip it. Only one stock reservation happens.

But that first delivery hits a problem: the product is out of stock. InsufficientStockException is non-retryable. The circuit breaker records the failure, ResilienceException propagates up to whenComplete(), and sendToDeadLetterQueue() captures everything — the full event payload, the failure reason, the saga ID, the source service. It’s all sitting in the framework_DLQ IMap, waiting.

An operator (or an LLM, as we’ll see in a moment) checks the DLQ, sees the pending entry, restocks the product, and replays the event. The OrderCreated record gets re-published to the ITopic, the saga picks up, and the order completes.

One wrinkle: the replayed event carries the same eventId as the original. If the 1-hour idempotency TTL hasn’t expired yet, the guard will block it as a duplicate. In practice this isn’t an issue — by the time you’ve investigated the failure, diagnosed the root cause, and fixed it, an hour has usually passed. It’s a deliberate trade-off: short-window deduplication versus immediate replay. We chose deduplication.

Configuration Reference

Dead Letter Queue: framework.dlq.*

Property	Default	Description
enabled	true	Master toggle
map-name	framework_DLQ	IMap name on shared cluster
max-replay-attempts	3	Maximum replays before permanent block
entry-ttl	168h	7-day retention for DLQ entries

Idempotency Guard: framework.idempotency.*

Property	Default	Description
enabled	true	Master toggle
map-name	framework_PROCESSED_EVENTS	IMap name on shared cluster
ttl	1h	How long to remember processed event IDs

Metrics

Metric	Type	Description
dlq.entries.added	Counter	Events added to the DLQ
dlq.entries.replayed	Counter	Events replayed from the DLQ
dlq.entries.discarded	Counter	Events discarded from the DLQ
idempotency.checks	Counter (tagged: result=hit\|miss)	Deduplication checks

The Complete Resilience Stack

Across Parts 7, 8, and 9, we’ve built five interlocking patterns:

Pattern	Layer	Purpose	Protects Against
Circuit Breaker	Consumer	Automatic service isolation	Cascade failures
Retry + Backoff	Consumer	Transient failure recovery	Network blips, brief outages
Transactional Outbox	Producer	Guaranteed delivery	Shared cluster unavailability
Dead Letter Queue	Consumer	Failure capture and replay	Permanent processing failures
Idempotency Guard	Consumer	Exactly-once processing	Duplicate delivery

They’re all optional — enabled by default, disabled with a single property toggle. They’re all auto-configured by Spring Boot. They all expose Micrometer metrics. And when disabled, the framework falls back to its previous behavior without breaking anything.

Three articles ago, we had a fire-and-forget event pipeline where a network blip could lose an event forever. Now we have guaranteed delivery, deduplication, failure capture, and replay. Same pipeline, five patterns later.

Try It Yourself

The demo script includes a complete DLQ investigation scenario — fault injection, failure capture, investigation, and replay — in 11 guided steps:

./scripts/demo-scenarios.sh 7

That’s the curl-based version. No LLM required.

The AI-Powered Version

This is more fun. Connect the MCP server from Part 6 to your LLM client — Claude Desktop, Claude Code, ChatGPT, whatever you’ve got — and try this prompt:

“Run the DLQ investigation demo — inject a failure, place an order, and show me what’s in the dead letter queue.”

The LLM calls runDemo to set up the scenario, then listDlqEntries and inspectDlqEntry to investigate. It tells you what happened — which event failed, at which service, and why — and suggests a fix. You say “replay it.” It calls replayDlqEntry, the saga completes, and you’ve just done incident response through a conversation.

No curl commands. No JSON parsing. No copy-pasting UUIDs. The LLM handles the plumbing while you make the decisions.

If the LLM already has context from earlier in the session, a shorter version works:

“Run the dlq_investigation demo scenario and tell me what you find.”

Next up: Choreography vs Orchestration: Two Saga Patterns

Previous: Hazelcast Transactional Outbox: Guaranteed Delivery

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

Tag: Resilience

Dead Letter Queue + Idempotency: Exactly-Once on Hazelcast

Part 1: Dead Letter Queue

The Problem

The DeadLetterEntry

The DeadLetterQueueOperations Interface

HazelcastDeadLetterQueue: IMap-Backed Storage

POJO-to-GenericRecord Conversion

Replay

Monitoring Queue Depth

Admin REST Endpoints

Integration with Saga Listeners

Part 2: Idempotency Guard

The Problem

Atomic Check-and-Claim

How putIfAbsent Works

TTL: Forgetting Old Events

Integration with Saga Listeners

The Processed Events Map

How the Three Patterns Work Together

Configuration Reference

Dead Letter Queue: framework.dlq.*

Idempotency Guard: framework.idempotency.*

Metrics

The Complete Resilience Stack

Try It Yourself

The AI-Powered Version

On Debugging by Assumption