Category: Event Driven Architecture

Event Sourcing Observability: Prometheus, Grafana, Jaeger

Part 12 in the “Building Event-Driven Microservices with Hazelcast” series

Over the past eleven posts, we’ve built an event sourcing framework, a Jet pipeline, materialized views, sagas, circuit breakers, an outbox, dead letter queues, and durable persistence. That’s a lot of moving parts.

Now: how do you observe what’s happening inside all of them?

Event sourcing changes the observability game. Traditional request-response applications have easy metrics — request rate, error rate, latency. In an event-sourced system, a single API call triggers an asynchronous pipeline that writes to an event store, updates a materialized view, publishes to subscribers, and potentially kicks off a multi-service saga. A latency spike could be hiding in any of those stages. You need to see into all of them.

This post builds a complete observability stack: Prometheus + Micrometer for metrics, Grafana for dashboards and alerting, Jaeger for distributed tracing.

How It Fits Together

Observability stack architecture: four Spring Boot services on ports 8081 through 8084 expose /actuator/prometheus endpoints that Prometheus scrapes every 15 seconds on port 9090, and export OTLP spans to Jaeger on port 16686; Grafana on port 3000 reads from Prometheus and renders five auto-provisioned dashboards with six alerts

Each service exposes /actuator/prometheus. Prometheus scrapes all four every 15 seconds. Grafana reads from Prometheus and renders dashboards. Jaeger collects distributed traces via OTLP.

Instrumenting the Framework

Metrics Architecture

An event-sourced system has a lot of moving parts, and each one needs its own instrumentation. The framework provides roughly 70 metrics across a dozen categories. They’re organized around two core subsystems — the event pipeline and the saga layer — plus several supporting categories for everything else.

Pipeline Metrics

PipelineMetrics tracks every event through the 6-stage pipeline:

public class PipelineMetrics {

    private final MeterRegistry registry;
    private final String domainName;

    // Events entering and leaving the pipeline
    Counter eventsReceived;   // "eventsourcing.pipeline.events.received"
    Counter eventsProcessed;  // "eventsourcing.pipeline.events.processed"
    Counter eventsFailed;     // "eventsourcing.pipeline.events.failed"

    // End-to-end latency histogram with percentiles
    Timer endToEndLatency;    // "eventsourcing.pipeline.latency.end_to_end"
    Timer queueWaitLatency;   // "eventsourcing.pipeline.latency.queue_wait"

    // Per-stage timing
    Timer stageDuration;      // "eventsourcing.pipeline.stage.duration"
                              // Tagged with stage: persist, update_view, publish
}

Every metric is tagged with domain (e.g., “Customer”, “Order”) and eventType (e.g., “CustomerCreated”), so you can filter as narrowly as you need:

Counter.builder("eventsourcing.pipeline.events.processed")
    .tag("domain", domainName)
    .tag("eventType", eventType)
    .register(registry);

The per-stage timer is the one I find most useful for debugging. If P99 spikes, you can see which stage is the bottleneck — is it the event store write, the view update, or the publication step?

public enum PipelineStage {
    SOURCE("source"),
    ENRICH("enrich"),
    PERSIST("persist"),
    UPDATE_VIEW("update_view"),
    PUBLISH("publish"),
    COMPLETE("complete")
}

public void recordStageTiming(PipelineStage stage, Instant start) {
    Timer.builder("eventsourcing.pipeline.stage.duration")
        .tag("domain", domainName)
        .tag("stage", stage.getLabel())
        .publishPercentiles(0.5, 0.95, 0.99)
        .register(registry)
        .record(Duration.between(start, Instant.now()));
}

Saga Metrics

SagaMetrics tracks the lifecycle of distributed sagas:

public class SagaMetrics {

    private static final String PREFIX = "saga";

    // Lifecycle counters (tagged by sagaType)
    "saga.started"               // Sagas initiated
    "saga.completed"             // Sagas completed successfully
    "saga.compensated"           // Sagas that required compensation
    "saga.failed"                // Sagas that failed
    "saga.timedout"              // Sagas that exceeded their deadline

    // Step-level counters
    "saga.steps.completed"       // Individual steps completed
    "saga.steps.failed"          // Individual steps failed
    "saga.compensation.started"  // Compensation processes initiated
    "saga.compensation.steps"    // Compensation steps executed

    // Duration timers (p50, p95, p99)
    "saga.duration"              // End-to-end saga duration
    "saga.compensation.duration" // Compensation duration
}

Beyond the Core

Pipeline and saga metrics tell you how events flow and how transactions coordinate. But a production system has more to watch. The framework instruments several additional subsystems:

The outbox pattern guarantees at-least-once delivery to the shared cluster. Metrics track entries written, claimed, delivered, and failed. If outbox.entries.written is climbing faster than outbox.entries.delivered, your delivery pipeline is falling behind.

Events that fail delivery repeatedly land in the DLQ. Three counters — dlq.entries.added, dlq.entries.replayed, dlq.entries.discarded — tell you whether poison messages are accumulating or getting resolved.

When PostgreSQL persistence is enabled, write/read latency, batch sizes, and error counts are tracked per map. High persistence.store.duration points to database bottlenecks.

The idempotency guard tracks duplicate detection tagged hit (duplicate blocked) or miss (new event). A high hit ratio under normal operation is actually good news — it means at-least-once delivery is working and duplicates are being caught.

Circuit breaker state, failure rates, and retry outcomes from Resilience4j are exposed automatically. When resilience4j_circuitbreaker_state flips to OPEN, a downstream service is in trouble.

And then there are business metrics — revenue, order item counts, customer totals, inventory replenishment. These are the ones business stakeholders actually care about. They bridge the gap between “pipeline is fast” and “orders are generating revenue.”

The complete catalog of every metric name, tag, Prometheus mapping, and troubleshooting guide is in the Metrics Reference Guide.

Caching Metric Instances

At 100,000+ events per second, looking up a Counter in Micrometer’s registry on every event is measurable overhead. SagaMetrics caches metric instances in a ConcurrentHashMap:

private final ConcurrentMap<String, Counter> counterCache;
private final ConcurrentMap<String, Timer> timerCache;

private Counter getCounter(String name, String sagaType) {
    String key = name + ":" + sagaType;
    return counterCache.computeIfAbsent(key, k ->
            Counter.builder(PREFIX + "." + name)
                    .tag("sagaType", sagaType)
                    .register(meterRegistry)
    );
}

Small detail, but at high throughput every nanosecond in the hot path matters.

JVM and System Metrics

Beyond application metrics, the framework auto-registers JVM metrics via MetricsConfig:

@Configuration
public class MetricsConfig {

    @Bean public JvmMemoryMetrics jvmMemoryMetrics()   { return new JvmMemoryMetrics(); }
    @Bean public JvmGcMetrics jvmGcMetrics()           { return new JvmGcMetrics(); }
    @Bean public JvmThreadMetrics jvmThreadMetrics()    { return new JvmThreadMetrics(); }
    @Bean public ClassLoaderMetrics classLoaderMetrics() { return new ClassLoaderMetrics(); }
    @Bean public ProcessorMetrics processorMetrics()    { return new ProcessorMetrics(); }
}

Common tags applied to every metric enable cross-service filtering:

registry.config().commonTags(Arrays.asList(
    Tag.of("application", applicationName),
    Tag.of("version", applicationVersion)
));

Every metric — JVM heap usage, saga completion rates, business revenue — can be filtered by service name in Grafana.

Configuring Prometheus

Service-Side: Spring Boot Actuator

Each service exposes metrics via Actuator with Prometheus export:

# application.yml
management:
  endpoints:
    web:
      exposure:
        include: health,prometheus,metrics,info
  endpoint:
    health:
      show-details: always
  metrics:
    export:
      prometheus:
        enabled: true
    tags:
      application: ${spring.application.name}

This creates the /actuator/prometheus endpoint that Prometheus scrapes.

Prometheus-Side: Scrape Configuration

Prometheus scrapes all four services and the Hazelcast cluster:

# docker/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'ecommerce-demo'

scrape_configs:
  - job_name: 'account-service'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['account-service:8081']

  - job_name: 'inventory-service'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['inventory-service:8082']

  - job_name: 'order-service'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['order-service:8083']

  - job_name: 'payment-service'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['payment-service:8084']

  - job_name: 'hazelcast'
    metrics_path: '/hazelcast/rest/cluster'
    static_configs:
      - targets:
          - 'hazelcast-1:5701'
          - 'hazelcast-2:5701'
          - 'hazelcast-3:5701'

Each service gets its own job, so Prometheus labels metrics with job=”account-service” etc. automatically.

Grafana Dashboards

Dashboard Strategy

Five pre-provisioned dashboards, each focused on a different operational concern:

Dashboard	Focus	Key Question It Answers
System Overview	Service health and throughput	“Is everything running? How much traffic?”
Event Flow	Pipeline performance	“How fast are events processing? Where are bottlenecks?”
Materialized Views	View update performance	“Are views keeping up with events?”
Saga Dashboard	Distributed transaction health	“Are sagas completing? Any failures or timeouts?”
Business Overview	Revenue, orders, customers	“Is the business healthy? Are orders generating revenue?”

System Overview is the home dashboard — first thing you see when you open Grafana.

Auto-Provisioning

Dashboards, datasources, and alerts are all provisioned automatically. When Grafana starts, it reads configuration from mounted volumes:

docker/grafana/
├── dashboards/                    # Dashboard JSON files
│   ├── system-overview.json       # Auto-loaded as home dashboard
│   ├── event-flow.json
│   ├── materialized-views.json
│   ├── saga-dashboard.json
│   └── business-overview.json
└── provisioning/
    ├── datasources/
    │   └── datasources.yml        # Points to Prometheus
    ├── dashboards/
    │   └── dashboards.yml         # Tells Grafana where to find JSONs
    └── alerting/
        ├── alerts.yml             # Alert rule definitions
        ├── contactpoints.yml      # Notification channels
        └── policies.yml           # Routing policies

No manual setup. docker-compose up and the dashboards are ready.

Key Dashboard Panels

System Overview

At-a-glance health for the whole system: service health indicators (green/red per service based on up{job=”…”}), event throughput by service, HTTP request rates, pipeline P95 latency, and a saga summary showing started, completed, failed, and timed out counts.

Saga Dashboard

Deep visibility into distributed transactions. Active saga count and compensating count — how many are in flight right now. Throughput charts for start, complete, and compensate rates, filterable by sagaType. Duration percentiles at P50, P95, P99. Success rate as a percentage. Timeout detection rate. Compensation breakdown — are any compensation steps failing?

The dashboard supports a $sagaType variable, so you can filter to just “OrderFulfillment” or “OrderFulfillmentOrchestrated” or view everything at once.

Event Flow

The pipeline performance dashboard: events published per second by service, end-to-end latency percentiles, queue wait latency (are events sitting around before processing starts?), a stacked stage duration breakdown at P95 for persist, update_view, and publish, and failed events by stage and type.

Business Overview

This one bridges technical and business concerns. Cumulative revenue over time, order rate with item counts, customer growth, saga success rate (what percentage of orders complete without compensation?), and end-to-end saga duration.

Alerting

Pre-Configured Alerts

Six alerts that cover the most common failure modes:

Saga Alerts

Alert	Severity	Condition	For
High Saga Failure Rate	Critical	increase(saga_failed_total[5m]) > 0	2 min
Saga Timeouts Detected	Warning	increase(saga_timeouts_detected_total[5m]) > 0	2 min
Saga Compensation Failures	Critical	increase(saga_compensations_failed_total[5m]) > 0	1 min
Low Saga Success Rate	Warning	Success rate < 90% over 10 minutes	5 min

Service Health Alerts

Alert	Severity	Condition	For
Service Down	Critical	up < 1 for any service	1 min
High Event Processing Error Rate	Warning	Error rate > 5% over 5 minutes	3 min

The “For” duration prevents flapping — a brief network blip won’t page you at 3am. Compensation failures fire fastest (1 minute) because a failed compensation means money or inventory is in an inconsistent state.

Distributed Tracing with Jaeger

Metrics tell you that something is slow. Tracing tells you why.

In our system, a single order placement can touch four services: Order creates the order, Inventory reserves stock, Payment processes the charge, Order confirms. With metrics alone, you see “P99 saga duration increased.” With tracing, you see “Payment Service is taking 2 seconds to respond to StockReserved events.” That’s the difference between knowing there’s a problem and knowing where it is.

Configuration

Tracing is enabled via Spring Boot’s OpenTelemetry integration:

management:
  tracing:
    enabled: true
    sampling:
      probability: 1.0   # Sample 100% of requests (reduce in production)
  otlp:
    tracing:
      endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT:http://localhost:4317}

Jaeger runs as an all-in-one container in the Docker stack, receiving traces via OTLP on port 4317. In the Jaeger UI (http://localhost:16686), you pick a service, find traces for a time window, click into a trace to see the span waterfall across all four services, and identify which operation is contributing to latency.

The most valuable traces in an event-sourced system: the full path from API request to pipeline completion, saga flows (OrderCreated → StockReserved → PaymentProcessed → OrderConfirmed), and compensation flows — where did the failure occur, and how long did compensation take?

Useful PromQL Queries

Queries you can run in Prometheus or use in custom Grafana panels:

Service Health

# Are all services up?
up{job=~".*-service"}

# HTTP request rate by service
sum by (application) (rate(http_server_requests_seconds_count[5m]))

Event Pipeline

# Event throughput
rate(eventsourcing_pipeline_events_processed_total[5m])

# End-to-end P99 latency
histogram_quantile(0.99, rate(eventsourcing_pipeline_latency_end_to_end_seconds_bucket[5m]))

# Queue wait time (events waiting to be processed)
histogram_quantile(0.95, rate(eventsourcing_pipeline_latency_queue_wait_seconds_bucket[5m]))

Sagas

# Saga completion rate
rate(saga_completed_total[5m])

# Saga success rate (percentage)
sum(saga_completed_total) /
  (sum(saga_completed_total) + sum(saga_compensated_total) +
   sum(saga_failed_total) + sum(saga_timedout_total))

# Saga duration P99
histogram_quantile(0.99, rate(saga_duration_seconds_bucket[5m]))

# Active timeouts in last 5 minutes
increase(saga_timeouts_detected_total[5m])

Business

# Cumulative revenue
order_revenue_total

# Orders per second
rate(order_items_count_total[5m])

# Current customer count
account_customers_total

JVM

# Heap memory usage
jvm_memory_used_bytes{area="heap"}

# GC pause time
rate(jvm_gc_pause_seconds_sum[5m])

Lessons Learned

Don’t just measure HTTP latency. In an event-sourced system, the interesting latency is inside the pipeline — from event submission to view update. HTTP latency includes that but hides where the time is spent.

Multi-dimensional tags (domain, eventType, sagaType, stage) are not optional. A P99 spike in “pipeline latency” is useless without knowing which domain and stage are affected.

Cache your Counter and Timer instances at high throughput. Registry lookups add up. ConcurrentHashMap.computeIfAbsent works well.

Provision everything as code. Don’t create dashboards by hand — provision them from JSON files. Your observability stack is version-controlled, reproducible, and deploys automatically. When someone clones the repo and runs docker-compose up, they get the same dashboards as everyone else.

Alert on business outcomes, not just infrastructure. “Service Down” is an infrastructure alert. “Saga Failure Rate” is a business outcome alert. Both matter, but the business alerts catch problems that don’t manifest as service crashes — like a payment gateway returning errors, causing saga compensations to spike while all four services stay green.

The framework provides roughly 70 metrics organized into a dozen categories — pipeline throughput and per-stage latency, saga lifecycle tracking, outbox delivery, dead letter queues, persistence latency, circuit breaker state, business KPIs, JVM health, HTTP request rates. Combined with auto-provisioned Grafana dashboards, pre-configured alerts, and distributed tracing via Jaeger, you get complete visibility into a system where a single API call can trigger asynchronous processing across four services.

Event sourcing makes observability both harder and more important. Events are asynchronous, distributed, and flow through multiple stages. Without good metrics and dashboards, you’re flying blind. The Metrics Reference Guide has the complete catalog.

Previous: Hazelcast Write-Behind MapStore: Durable Event Sourcing

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

July 21, 2026

Hazelcast Write-Behind MapStore: Durable Event Sourcing

Part 11 in the “Building Event-Driven Microservices with Hazelcast” series

In Part 9 and Part 10, we finished the reliability and coordination layer — dead letter queues, idempotency guards, two saga patterns. But there’s been a fundamental gap this whole time: every piece of data lives exclusively in Hazelcast IMaps. A full cluster restart erases everything. The event store, the materialized views, the saga state. Gone.

For a demo that runs 30 minutes, that’s fine. For a production system — or even a trade show booth running for hours — it’s not. Events are the source of truth in an event-sourced system. Losing them means losing business history.

This post covers how we added durable persistence to the framework, primarily through Hazelcast’s MapStore mechanism but with one notable exception, without changing a single line of service business logic.

The Problem

Our event sourcing pipeline writes to several types of IMaps: the event store (Customer_ES, Product_ES, etc.), materialized views (Customer_VIEW, Order_VIEW, etc.), and supporting maps for saga state, the outbox, and the DLQ. All in-memory. Hazelcast’s IMap is fast precisely because it avoids disk I/O.

But that creates two problems. First, data loss on restart — the event log is gone, you can’t rebuild views or replay events or audit what happened. Second, unbounded memory growth — during a long-running demo, events accumulate indefinitely, the JVM runs out of heap, and the pod gets OOMKilled. We saw this happen at about the 45-minute mark under sustained load.

We need to persist to a durable store (PostgreSQL) while keeping the in-memory performance characteristics intact.

Why MapStore?

Hazelcast’s MapStore interface is the natural integration point. It’s a callback mechanism — Hazelcast calls your code whenever entries are written to or read from an IMap:

Write-behind MapStore data flow: the service calls IMap.put on the Hazelcast IMap and returns immediately with no database wait; the MapStore buffers writes and flushes them to PostgreSQL in asynchronous JDBC batches, while MapLoader reloads entries from PostgreSQL on a cache miss or cold start

We use write-behind mode. Hazelcast buffers writes and flushes them asynchronously in batches. The IMap.put() call returns immediately — the service never waits for PostgreSQL. This matters because our Jet pipeline calls put() on every event, and we can’t afford database latency in the hot path.

There’s also write-through mode (writeDelaySeconds=0), where every put() synchronously writes to the database. We don’t use it. It would negate the entire point of in-memory processing.

The MapStore also implements MapLoader, which Hazelcast calls on cache misses and cold starts. This gives us automatic rehydration: if a service restarts, the views reload from PostgreSQL without any special recovery code. No replay, no rebuild — the data is just there.

Architecture

The persistence layer splits across two modules, with provider-agnostic interfaces in framework-core and database-specific implementations in framework-postgres. The full design rationale is in ADR 012.

Provider-Agnostic Interfaces (framework-core)

Four persistence interfaces, one per map type:

public interface EventStorePersistence {
    void persist(String mapName, PersistableEvent event);
    void persistBatch(String mapName, List<PersistableEvent> events);
    Optional<PersistableEvent> loadEvent(String mapName, String mapKey);
    Iterable<String> loadAllKeys(String mapName);
    void delete(String mapName, String mapKey);
    boolean isAvailable();
}

ViewStorePersistence follows the same shape but uses upsert semantics — newer entries replace older ones for the same key. OutboxStorePersistence adds loadNonDeliveredKeys() for recovering in-flight entries on restart. DlqStorePersistence adds loadPendingKeys() for the same reason.

These interfaces know nothing about Hazelcast, GenericRecord, or Compact serialization. They operate on portable record types — PersistableEvent, PersistableView, PersistableOutboxEntry, PersistableDeadLetterEntry — simple Java records containing strings and longs. Clean enough for any JDBC-compatible database.

MapStore Adapters (framework-core)

EventStoreMapStore, ViewStoreMapStore, and OutboxMapStore implement Hazelcast’s MapStore interface and delegate to the persistence interfaces above. They handle key serialization (converting PartitionedSequenceKey<String> to a string format like seq:12345|key:cust-001), GenericRecord-to-JSON conversion via GenericRecordJsonConverter, and metadata extraction from GenericRecord fields.

There’s no DLQ MapStore adapter. The DLQ can’t use MapStore at all — more on why below.

PostgreSQL Implementation (framework-postgres)

PostgresEventStorePersistence uses JPA for single-record operations and JdbcTemplate.batchUpdate() for batches. Events use ON CONFLICT DO NOTHING (append-only — if it’s already there, leave it alone). Views and outbox entries use ON CONFLICT DO UPDATE (upsert — latest state wins).

Flyway manages the schema:

CREATE TABLE domain_events (
    map_name       VARCHAR(255) NOT NULL,
    map_key        VARCHAR(512) NOT NULL,
    aggregate_id   VARCHAR(255) NOT NULL,
    sequence       BIGINT NOT NULL,
    event_type     VARCHAR(255) NOT NULL,
    event_data     JSONB NOT NULL,
    timestamp_millis BIGINT NOT NULL,
    correlation_id VARCHAR(255),
    created_at     TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (map_name, map_key)
);

In-Memory Fallback (framework-core)

Each of the four persistence interfaces has a ConcurrentHashMap-backed in-memory implementation: InMemoryEventStorePersistence, InMemoryViewStorePersistence, InMemoryOutboxStorePersistence, InMemoryDlqStorePersistence. When framework.persistence.enabled=true but no PostgreSQL driver is on the classpath, the auto-configuration falls back to these. The persistence pipeline runs — good for testing the wiring — without requiring an actual database.

How Write-Behind Works

The timeline of a single event being persisted:

Timeline of a single event: the service receives the request, creates the event, calls IMap.put and responds to the client within milliseconds; roughly five seconds later the Hazelcast write-behind timer fires and the MapStore issues a batched JDBC INSERT into PostgreSQL

The service responds in milliseconds. The database write happens 5 seconds later, batched with other events.

MapStore Behavior by Map Type

Aspect	Event Store (_ES)	View Store (_VIEW)	Outbox (framework_OUTBOX)
Write semantics	INSERT (append-only)	UPSERT (latest wins)	UPSERT (status transitions)
Coalescing	Disabled — each event is unique	Enabled — only latest state per key	Enabled — only latest status per entry
Initial load	LAZY — events loaded on demand	EAGER — all keys loaded on cold start	LAZY — non-delivered entries on demand

Coalescing is worth explaining. If a customer’s address changes three times during the five-second write-behind window, only the final state gets persisted. That’s correct for views — they represent current state, not history. Events are never coalesced because each one is a distinct historical fact. The outbox coalesces because entries transition through statuses (PENDING → CLAIMED → DELIVERED) and only the latest status matters for recovery.

Bounded Memory with Eviction

Persistence unlocks something else: IMap eviction. Without a backing store, evicting an entry means losing it permanently. With a MapStore behind the map, evicted entries can be reloaded on demand via MapLoader.load().

This turns IMaps into bounded hot caches:

framework:
  persistence:
    enabled: true
    event-store-eviction:
      enabled: true
      max-size: 10000        # per node
      eviction-policy: LRU
    view-store-eviction:
      enabled: true
      max-size: 10000
      eviction-policy: LRU
      max-idle-seconds: 3600  # evict views idle > 1 hour

When the map reaches 10,000 entries, the least recently used ones get evicted. If a subsequent get() hits an evicted key, Hazelcast calls MapLoader.load(), reads from PostgreSQL, and puts the entry back. The service code never knows the difference — it’s the same IMap.get() call either way.

Memory stays bounded. No OOMKill after hours of continuous load. Hot data stays in-memory at sub-millisecond latency. Cold data reloads transparently.

The DLQ Exception: Direct Writes Instead of MapStore

The dead letter queue is the one map that can’t use MapStore. The reason traces directly back to the dual-instance Hazelcast architecture.

The event store, view store, and outbox all live on the embedded Hazelcast instance — the standalone one that runs Jet pipelines inside each service. MapStore is a server-side configuration: you attach it to a map on a Hazelcast member, Hazelcast calls your code when entries change. Works great because the embedded instance is a full member that the service controls.

The DLQ lives on the shared cluster — the external 3-node Hazelcast cluster that services connect to as clients. Services write to the DLQ via hazelcastClient. MapStore is configured on the server side, and the shared cluster nodes don’t have the service’s persistence beans. You simply cannot attach a MapStore to a map accessed through a client connection.

So the DLQ does direct persistence writes. When HazelcastDeadLetterQueue.add(), replay(), or discard() is called, it writes to the IMap and calls DlqStorePersistence.persist() in the same method:

public void add(DeadLetterEntry entry) {
    GenericRecord record = toGenericRecord(entry);
    dlqMap.set(entry.getId(), record);
    persistIfAvailable(entry);
}

Synchronous, not write-behind. The trade-off is fine: DLQ entries are rare — they represent failures — so a database write in the hot path is negligible. If persistence itself fails, the entry is still in the IMap. The failure gets logged, but it doesn’t block the DLQ operation.

On startup, loadFromPersistence() hydrates the IMap with PENDING entries from PostgreSQL. Terminal entries (REPLAYED, DISCARDED) aren’t recovered — they’ve already been handled.

Metrics and Observability

Every MapStore operation is instrumented with Micrometer, following the same ConcurrentHashMap-cached counter/timer pattern used by PipelineMetrics and SagaMetrics:

Metric	Type	Description
persistence.store.count	Counter	Single write operations
persistence.store.batch.count	Counter	Batch write operations
persistence.store.batch.entries	Counter	Total entries across all batches
persistence.load.count	Counter	Load operations (cache misses)
persistence.load.miss	Counter	Load misses (not in DB either)
persistence.delete.count	Counter	Delete operations
persistence.errors	Counter	Errors by operation
persistence.store.duration	Timer	Write latency (p50/p95/p99)
persistence.load.duration	Timer	Load latency (p50/p95/p99)

A pre-built Grafana dashboard (persistence-dashboard.json) auto-provisions alongside the existing ones and shows throughput by map, latency percentiles, batch sizes, and error rates.

Metrics are optional — MapStore constructors accept a nullable PersistenceMetrics parameter. No MeterRegistry in the context (unit tests, for instance), no metrics. Nothing breaks.

Zero Code Changes in Services

The design goal I cared about most: enabling persistence shouldn’t require touching business logic. The complete diff in a service’s application.yml:

# Add to any service to enable persistence
framework:
  persistence:
    enabled: true

spring:
  datasource:
    url: jdbc:postgresql://localhost:5432/ecommerce
    username: ecommerce
    password: ecommerce

And add framework-postgres as a Maven dependency. That’s it.

The auto-configuration chain handles everything else. PostgresPersistenceAutoConfiguration detects the PostgreSQL driver and creates persistence beans for all four map types. PersistenceAutoConfiguration creates MapStore adapters for event, view, and outbox maps and wires them to the persistence beans. Each service’s config class detects the adapters via @Autowired(required = false) and attaches them to the IMap configurations. DeadLetterQueueAutoConfiguration passes the optional DlqStorePersistence bean directly to HazelcastDeadLetterQueue. Hazelcast handles the write-behind scheduling, batching, and MapLoader callbacks for the MapStore-backed maps.

If framework-postgres isn’t on the classpath, the in-memory fallback kicks in. If framework.persistence.enabled is false (the default), nothing changes at all.

Custom Providers

Swapping PostgreSQL for another database means implementing four interfaces: EventStorePersistence, ViewStorePersistence, OutboxStorePersistence, and DlqStorePersistence. Create an @AutoConfiguration class with @AutoConfigureBefore(PersistenceAutoConfiguration.class), register the beans as @ConditionalOnMissingBean, and add to AutoConfiguration.imports.

The in-memory implementations are about 50 lines each — they serve as a decent reference.

What We Built

Component	Purpose
EventStorePersistence / ViewStorePersistence	Provider-agnostic interfaces (event store, views)
OutboxStorePersistence / DlqStorePersistence	Provider-agnostic interfaces (outbox, DLQ)
PersistableEvent / PersistableView	Portable records (decoupled from GenericRecord)
PersistableOutboxEntry / PersistableDeadLetterEntry	Portable records (outbox, DLQ)
EventStoreMapStore / ViewStoreMapStore / OutboxMapStore	Hazelcast MapStore adapters (write-behind)
GenericRecordJsonConverter	Compact GenericRecord to/from JSON
PostgresEventStorePersistence / PostgresViewStorePersistence	PostgreSQL implementation (events, views)
PostgresOutboxStorePersistence / PostgresDlqStorePersistence	PostgreSQL implementation (outbox, DLQ)
InMemory*Persistence (x4)	Development/test fallback for all map types
PersistenceProperties	Spring Boot configuration (write-delay, batch size, eviction)
PersistenceMetrics	Micrometer counters and timers
PersistenceAutoConfiguration	Auto-wiring with fallback chain

Configuration Reference

framework:
  persistence:
    enabled: true                  # Master switch (default: false)
    write-delay-seconds: 5         # Batch window (default: 5)
    write-batch-size: 100          # Max entries per batch (default: 100)
    write-coalescing: false        # Coalesce writes (default: false)
    initial-load-mode: LAZY        # LAZY or EAGER (default: LAZY)
    event-store-eviction:
      enabled: true                # Enable eviction (default: true)
      max-size: 10000              # Max entries per node
      max-size-policy: PER_NODE
      eviction-policy: LRU
      max-idle-seconds: 0          # 0 = no idle eviction
    view-store-eviction:
      enabled: true
      max-size: 10000
      max-size-policy: PER_NODE
      eviction-policy: LRU
      max-idle-seconds: 3600       # Evict idle views after 1 hour

The framework now has a complete data lifecycle: events created in-memory for speed, persisted to PostgreSQL for durability, evicted when memory is constrained, reloaded on demand. The in-memory event sourcing performance is unchanged — PostgreSQL is strictly write-behind, never in the hot path.

The Persistence Guide has the full reference including PostgreSQL setup, custom provider implementation, eviction tuning, and troubleshooting.

Previous: Saga Orchestration vs Choreography on Hazelcast

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

July 13, 2026

Saga Orchestration vs Choreography on Hazelcast

Part 10 in the “Building Event-Driven Microservices with Hazelcast” series

Back in Part 4, we built a choreographed saga for order fulfillment. Four services — Order, Inventory, Payment, and Account — coordinate through Hazelcast ITopic events. Each service reacts independently, no central coordinator. The flow is implicit, spread across three saga listeners, a compensation registry, and a timeout detector.

That works. It works well, actually, for loosely coupled flows where services don’t need to know about each other. But I kept running into the same questions: what if you need the whole saga visible in one place? What if you need per-step timeout and retry? What if the caller wants to wait for the saga to finish before responding?

So the framework now supports a second saga architectural pattern — the orchestrated saga. This post compares both, walks through the implementation, and shows how they run side by side in the same system.

When to Use Which

Neither pattern wins across the board. It depends on what you need:

Requirement	Choreography	Orchestration
Services should be fully decoupled	Best
Need the whole flow in one readable file		Best
Caller needs a synchronous response		Best
High throughput (thousands of sagas/sec)	Best
Per-step timeout and retry		Best
Services evolve independently	Best
Complex branching or conditional logic		Best
No single point of failure	Best

Choreography is the better fit when services publish events for many consumers, not just one saga. Adding a new saga consumer doesn’t require changing any existing service — you just stand up a new listener.

Orchestration is the better fit when the saga is a well-defined workflow with a clear owner and the caller (a REST endpoint, typically) wants to return the result directly. Order fulfillment is a textbook example.

Architecture Comparison

Choreographed Flow

Choreographed saga flow: the caller receives 202 Accepted immediately from the Order Service, then OrderCreated, StockReserved, and PaymentProcessed events propagate as asynchronous Hazelcast ITopic messages through Inventory, Payment, and Account services with no central coordinator

Every arrow is an asynchronous event on the shared Hazelcast cluster. The caller gets back a 202 Accepted immediately. The saga completes whenever it completes.

Orchestrated Flow

Orchestrated saga flow: the caller invokes the Saga Orchestrator and waits for the final result while the orchestrator makes synchronous HTTP calls for CreateOrder, ReserveStock, ProcessPayment, and ConfirmOrder to the Order, Inventory, Payment, and Account services

Every arrow is a synchronous HTTP call. The orchestrator waits for each step to finish before moving to the next. The caller gets the final result — success or failure — in the response.

The visual difference tells you most of what you need to know. Choreography is a chain. Orchestration is a hub with spokes.

Implementation Comparison

Choreography: Saga Listeners

In the choreographed pattern, each service has a listener subscribed to events on the shared Hazelcast cluster:

@Component
public class InventorySagaListener {

    public InventorySagaListener(
            @Qualifier("hazelcastClient") HazelcastInstance hazelcast,
            InventoryService inventoryService,
            SagaStateStore sagaStateStore) {

        ITopic<GenericRecord> topic = hazelcast.getTopic("OrderCreated");
        topic.addMessageListener(message -> {
            GenericRecord event = message.getMessageObject();
            String sagaId = event.getString("sagaId");

            // Guard: only process OrderFulfillment sagas
            if (!"OrderFulfillment".equals(event.getString("sagaType"))) return;

            // Perform local action
            inventoryService.reserveStockForSaga(productId, quantity, ...);

            // Update saga state and publish next event
            sagaStateStore.updateOrAddStep(sagaId, 1, StepStatus.COMPLETED);
            hazelcast.getTopic("StockReserved").publish(nextEvent);
        });
    }
}

Three listeners across three services, wired together only by event names. To understand the full flow, you read code in three different modules. Compensation is handled by a CompensationRegistry that maps forward events to their compensating counterparts:

registry.register("OrderCreated", "OrderCancelled", "order-service");
registry.register("StockReserved", "StockReleased", "inventory-service");
registry.register("PaymentProcessed", "PaymentRefunded", "payment-service");

Orchestration: SagaDefinition Builder

The orchestrated version puts the entire saga in one place:

@Component
public class OrderFulfillmentSagaFactory {

    public SagaDefinition create() {
        return SagaDefinition.builder()
                .name("OrderFulfillmentOrchestrated")

                .step("CreateOrder")
                    .action(this::createOrderAction)
                    .compensation(this::createOrderCompensation)
                    .timeout(Duration.ofSeconds(15))
                    .build()

                .step("ReserveStock")
                    .action(this::reserveStockAction)
                    .compensation(this::reserveStockCompensation)
                    .timeout(Duration.ofSeconds(15))
                    .build()

                .step("ProcessPayment")
                    .action(this::processPaymentAction)
                    .compensation(this::processPaymentCompensation)
                    .timeout(Duration.ofSeconds(15))
                    .build()

                .step("ConfirmOrder")
                    .action(this::confirmOrderAction)
                    .noCompensation()
                    .timeout(Duration.ofSeconds(10))
                    .build()

                .sagaTimeout(Duration.ofSeconds(60))
                .build();
    }
}

Four steps, forward actions, compensations, timeouts — all readable in one file. You pay for that readability: the Order Service now has direct knowledge of the Inventory and Payment services’ HTTP endpoints.

The Orchestrator State Machine

HazelcastSagaOrchestrator is the engine that executes a SagaDefinition. The execution flow:

HazelcastSagaOrchestrator execution flow: start records the saga and schedules a 60-second timeout, then each step runs its action under a 15-second timeout, branching to SUCCESS which merges context and executes the next step, FAILURE after retries which compensates in reverse order, or TIMEOUT which compensates in reverse order

Internally, a SagaExecution instance tracks the running state: current step index, completed step names, an AtomicBoolean for compensation (preventing a race between step failure and the saga-level timeout firing at the same moment), and step start timestamps for duration metrics.

Per-Step Retry

Each SagaStep can configure maxRetries and retryDelay. When a step fails, the orchestrator checks if retries remain. If so, it waits retryDelay milliseconds and re-executes. If not, compensation kicks in.

This is separate from the Resilience4j circuit breakers that the choreographed saga listeners use. Different communication styles, different retry mechanisms.

Why HTTP Instead of ITopic?

You might wonder why the orchestrator makes HTTP calls to the other services instead of publishing events on Hazelcast ITopic.

Two reasons.

First, request-response semantics. The orchestrator needs to know whether each step succeeded before proceeding to the next. ITopic is fire-and-forget — there’s no built-in way for a publisher to wait for a consumer’s response. HTTP gives you synchronous request-response for free.

Second, our dual-instance architecture. Each service runs an embedded Hazelcast instance for Jet pipelines and a client to the shared cluster for cross-service events. Jet pipeline lambdas reference service-specific classes that can’t serialize across services — that’s the whole reason for the dual-instance design (see Part 5 for where this architecture first bit us). HTTP sidesteps Hazelcast serialization entirely. Each service processes the request in its own JVM with full access to its own classes.

The SagaServiceClient wraps these calls:

public class SagaServiceClient implements SagaServiceClientOperations {

    public OrchestratedStepResponse reserveStock(
            String productId, int quantity, String orderId) {
        // POST /api/saga/inventory/reserve-stock
        return restTemplate.postForObject(
                inventoryServiceUrl + "/api/saga/inventory/reserve-stock",
                request, OrchestratedStepResponse.class);
    }

    public OrchestratedStepResponse processPayment(
            String orderId, String customerId,
            double amount, String currency, String method) {
        // POST /api/saga/payment/process
        return restTemplate.postForObject(
                paymentServiceUrl + "/api/saga/payment/process",
                request, OrchestratedStepResponse.class);
    }
}

Each remote service exposes dedicated saga endpoints (like /api/saga/inventory/reserve-stock) that return an OrchestratedStepResponse — a success/failure envelope. The SagaServiceClient implements the SagaServiceClientOperations interface, which exists so Mockito can mock it on Java 25. (Mockito’s inline mock maker can’t mock concrete classes there. We hit this in several places — extract an interface, move on.)

Compensation: Two Approaches

Choreography: Event-Based

When a step fails, the SagaCompensator looks up the CompensationRegistry and publishes compensation events via ITopic:

Each service processes its own compensation event independently. The SagaTimeoutDetector can also trigger this if a saga exceeds its deadline.

Orchestration: Lambda-Based

When a step fails, the orchestrator walks completed steps in reverse and executes their compensation lambdas directly:

No events, no registry. The compensation logic sits right next to the forward action in the SagaDefinition. The final step (ConfirmOrder) uses .noCompensation() — there’s nothing to undo once you’ve confirmed.

If a compensation step itself fails, the orchestrator marks the saga as FAILED rather than COMPENSATED. That means manual intervention. It’s not a situation you want, but at least you know about it immediately rather than discovering it later in an audit.

Running Both Simultaneously

Both patterns coexist in the same system. No interference.

Pattern	Saga Type	REST Endpoint
Choreography	OrderFulfillment	POST /api/orders
Orchestration	OrderFulfillmentOrchestrated	POST /api/orders/orchestrated

The key to coexistence is the sagaType field. Choreographed saga listeners filter on it — when the orchestrated flow creates an OrderCreated event, the InventorySagaListener ignores it because the type is “OrderFulfillmentOrchestrated”, not “OrderFulfillment”.

Both patterns write to the same SagaStateStore (a Hazelcast IMap), so you can query across both or filter:

# All sagas
curl http://localhost:8083/api/sagas

# Only choreographed
curl http://localhost:8083/api/sagas?type=OrderFulfillment

# Only orchestrated
curl http://localhost:8083/api/sagas?type=OrderFulfillmentOrchestrated

The MCP list_sagas tool supports the same filter:

list_sagas(type="OrderFulfillmentOrchestrated", status="COMPLETED")

Observability

The orchestrator records a saga.step.duration timer for every step:

private void recordStepDuration(SagaExecution exec, String stepName) {
    if (sagaMetrics != null && exec.stepStartedAt != null) {
        Duration stepDuration = Duration.between(exec.stepStartedAt, Instant.now());
        sagaMetrics.recordStepDuration(
                exec.definition.getName(), stepName, stepDuration);
    }
}

Tagged with sagaType and stepName, so you can query individual steps:

# p95 duration for the ProcessPayment step
histogram_quantile(0.95,
  rate(saga_step_duration_seconds_bucket{
    sagaType="OrderFulfillmentOrchestrated",
    stepName="ProcessPayment"
  }[5m]))

Choreographed sagas track overall saga_duration_seconds but not per-step timing — the flow is distributed across services, so there’s no single place to measure each step. That’s a genuine observability trade-off between the two patterns.

The Grafana saga dashboard has a Choreography vs Orchestration row: p50/p95 duration comparison, success/failure rates per pattern, and an orchestrated step breakdown showing where time goes across CreateOrder, ReserveStock, ProcessPayment, and ConfirmOrder.

The MCP run_demo tool includes orchestrated scenarios:

Scenario	Pattern	Expected Outcome
happy_path	Choreographed	Order confirmed via events
payment_failure	Choreographed	Stock released via compensation events
orchestrated_happy_path	Orchestrated	Order confirmed via HTTP, sync response
orchestrated_payment_failure	Orchestrated	Stock released via reverse compensation, 409 response

The Summary Table

	Choreography	Orchestration
Communication	Hazelcast ITopic events	HTTP calls
Flow definition	Distributed across listeners	Centralized in SagaDefinition
Compensation	CompensationRegistry + event publishing	Reverse-order lambda execution
Timeout handling	SagaTimeoutDetector (scheduled)	Per-step + saga-level timeouts
Response model	Async (202 Accepted)	Sync (201 Created or 409 Conflict)
Retry	Resilience4j (circuit breaker + retry)	Built-in per-step retry with delay
Metrics	Saga-level duration	Saga-level + per-step duration
Saga type	OrderFulfillment	OrderFulfillmentOrchestrated

Choreography is still the right default for most event-sourced systems — it preserves service independence and scales naturally. Orchestration earns its place when you need the flow readable in one file, synchronous responses, and fine-grained per-step control.

Both patterns share the same SagaStateStore, the same Grafana dashboards, and the same MCP tools. Pick the right one for each saga, or run both and compare.

Previous: Dead Letter Queue + Idempotency: Exactly-Once on Hazelcast

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

July 6, 2026

Dead Letter Queue + Idempotency: Exactly-Once on Hazelcast

Part 9 in the “Building Event-Driven Microservices with Hazelcast” series

Over the past two articles, we built resilience into both sides of our saga communication. Part 7 added circuit breakers and retry to protect saga listeners against transient failures during event consumption. Part 8 added the transactional outbox to guarantee event delivery from producer to shared cluster.

Two gaps remain.

First: what happens when an event fails processing permanently? The circuit breaker exhausts retries. NonRetryableException gets thrown. The event is gone — all that survives is a log message. There’s no way to inspect what failed, understand why, or retry it later when someone fixes the underlying problem.

Second: what happens when the outbox delivers an event twice? At-least-once delivery means duplicates are possible. Without protection, the Inventory Service might reserve stock twice for the same order. The Payment Service might charge the customer twice.

This article covers two complementary patterns that close these gaps. The dead letter queue captures events that fail consumer-side processing, giving operators a way to inspect, replay, and discard them. The idempotency guard ensures each event is processed exactly once, even if delivered multiple times.

Put them together with the outbox and you get effectively-once semantics — at-least-once delivery on the producer side, exactly-once processing on the consumer side. That’s the gold standard for event-driven systems.

Part 1: Dead Letter Queue

The Problem

Consider this failure sequence in the Inventory Service’s saga listener:

Permanent failure without a dead letter queue: an OrderCreated event arrives, the InventorySagaListener picks it up and calls executeWithResilience, a non-retryable InsufficientStockException is thrown, the circuit breaker records the failure, a ResilienceException propagates, the error is logged, and the event is gone with only a log line surviving

That log message? That’s all you’ve got. In production, recovering from this means searching logs for the event ID, reconstructing the event payload from other sources, manually fixing whatever went wrong, and then figuring out how to re-trigger the saga step.

A dead letter queue captures the failed event — payload, failure reason, saga context, source service, everything — in a durable store that you can actually query and act on.

The DeadLetterEntry

Each DLQ entry preserves the full failure context:

public class DeadLetterEntry {

    private String dlqEntryId;       // UUID — unique DLQ identifier
    private String originalEventId;  // The event that failed
    private String eventType;        // "OrderCreated", "StockReserved", etc.
    private String topicName;        // The ITopic where the event was published
    private GenericRecord eventRecord; // The complete event payload for replay
    private String failureReason;    // Why processing failed
    private Instant failureTimestamp; // When the failure occurred
    private String sourceService;    // Which service failed ("inventory-service")
    private String sagaId;           // Saga context for tracing
    private String correlationId;    // Correlation context for tracing
    private int replayCount;         // How many times this entry has been replayed
    private Status status;           // PENDING, REPLAYED, or DISCARDED

    public enum Status {
        PENDING,    // Awaiting review or replay
        REPLAYED,   // Re-published to original topic
        DISCARDED   // Manually discarded by administrator
    }
}

Construction at the failure site uses a builder:

DeadLetterEntry.builder()
    .originalEventId(eventId)
    .eventType(record.getString("eventType"))
    .topicName("OrderCreated")
    .eventRecord(record)
    .failureReason(error.getMessage())
    .sourceService("inventory-service")
    .sagaId(record.getString("sagaId"))
    .correlationId(record.getString("correlationId"))
    .build();

The eventRecord field is the important one — it holds the complete GenericRecord that was published to the ITopic. When you replay the entry, this exact record gets re-published to the original topic, picking the saga back up where it left off.

The DeadLetterQueueOperations Interface

Same interface-extraction pattern we used for ResilientOperations and ServiceClientOperations (Java 25 Mockito can’t mock concrete classes, so we keep extracting interfaces — it’s becoming a running theme):

public interface DeadLetterQueueOperations {

    void add(DeadLetterEntry entry);

    List<DeadLetterEntry> list(int limit);

    DeadLetterEntry getEntry(String dlqEntryId);

    void replay(String dlqEntryId);

    void discard(String dlqEntryId);

    long count();
}

HazelcastDeadLetterQueue: IMap-Backed Storage

The implementation stores DLQ entries as Compact-serialized GenericRecords in a Hazelcast IMap — same pattern as the HazelcastOutboxStore:

public class HazelcastDeadLetterQueue implements DeadLetterQueueOperations {

    private static final String SCHEMA_NAME = "DeadLetterEntry";
    private final HazelcastInstance hazelcast;
    private final IMap<String, GenericRecord> dlqMap;
    private final DeadLetterQueueProperties properties;
    private final MeterRegistry meterRegistry;

    public HazelcastDeadLetterQueue(HazelcastInstance hazelcast,
                                     DeadLetterQueueProperties properties,
                                     MeterRegistry meterRegistry) {
        this.hazelcast = hazelcast;
        this.dlqMap = hazelcast.getMap(properties.getMapName());
        // ...
    }
}

The DLQ map lives on the shared cluster (falling back to the embedded instance if there’s no shared cluster), so it’s accessible from any service’s admin endpoint. You don’t need to know which service failed — query the DLQ from anywhere and you’ll see everything.

POJO-to-GenericRecord Conversion

Like the outbox store, conversion happens at the boundary:

static GenericRecord toRecord(final DeadLetterEntry entry) {
    return GenericRecordBuilder.compact(SCHEMA_NAME)
            .setString("dlqEntryId", entry.getDlqEntryId())
            .setString("originalEventId", entry.getOriginalEventId())
            .setString("eventType", entry.getEventType())
            .setString("topicName", entry.getTopicName())
            .setGenericRecord("eventRecord", entry.getEventRecord())
            .setString("failureReason", entry.getFailureReason())
            .setInt64("failureTimestamp", entry.getFailureTimestamp().toEpochMilli())
            .setString("sourceService", entry.getSourceService())
            .setString("sagaId", entry.getSagaId())
            .setString("correlationId", entry.getCorrelationId())
            .setInt32("replayCount", entry.getReplayCount())
            .setString("status", entry.getStatus().name())
            .build();
}

Note setGenericRecord(“eventRecord”, …) — Compact serialization handles nested GenericRecords natively. The full event payload comes along for the ride without any special serialization work on our part.

Replay

This is where the DLQ earns its keep. Once you’ve figured out what went wrong and fixed it — restocked inventory, restarted a flaky service, whatever — you replay the entry:

@Override
public void replay(final String dlqEntryId) {
    final GenericRecord record = dlqMap.get(dlqEntryId);
    if (record == null) {
        throw new IllegalArgumentException("DLQ entry not found: " + dlqEntryId);
    }

    final DeadLetterEntry entry = fromRecord(record);

    if (entry.getStatus() != DeadLetterEntry.Status.PENDING) {
        throw new IllegalStateException(
                "Cannot replay entry in status " + entry.getStatus());
    }
    if (entry.getReplayCount() >= properties.getMaxReplayAttempts()) {
        throw new IllegalStateException(
                "Max replay attempts (" + properties.getMaxReplayAttempts() + ") exceeded");
    }

    // Re-publish to the original topic
    final GenericRecord eventRecord = entry.getEventRecord();
    if (eventRecord != null && entry.getTopicName() != null) {
        final ITopic<GenericRecord> topic = hazelcast.getTopic(entry.getTopicName());
        topic.publish(eventRecord);
    }

    // Update entry status
    entry.setReplayCount(entry.getReplayCount() + 1);
    entry.setStatus(DeadLetterEntry.Status.REPLAYED);
    dlqMap.set(dlqEntryId, toRecord(entry));

    meterRegistry.counter("dlq.entries.replayed").increment();
}

A few safety guards here. Only PENDING entries can be replayed — you can’t accidentally replay something that was already replayed or discarded. There’s a configurable max replay count (default 3) to prevent infinite replay loops if the underlying issue isn’t actually fixed. And if the eventRecord is somehow null (shouldn’t happen, but defensive coding), the status updates without attempting a publish.

Monitoring Queue Depth

The count() method uses a Hazelcast predicate to count only PENDING entries:

@Override
public long count() {
    final Collection<GenericRecord> pending = dlqMap.values(
            Predicates.equal("status", DeadLetterEntry.Status.PENDING.name()));
    return pending.size();
}

A DLQ count above zero for more than a few minutes is a flag that something needs attention. Wire this to an alert and you’ll know about failed events before anyone files a ticket.

Admin REST Endpoints

The DeadLetterQueueController exposes the DLQ through REST:

@RestController
@RequestMapping("/api/admin/dlq")
@Tag(name = "Dead Letter Queue")
public class DeadLetterQueueController {

    @GetMapping
    public ResponseEntity<List<DeadLetterEntry>> list(
            @RequestParam(defaultValue = "20") int limit) {
        return ResponseEntity.ok(deadLetterQueue.list(limit));
    }

    @GetMapping("/count")
    public ResponseEntity<Map<String, Long>> count() {
        return ResponseEntity.ok(Map.of("count", deadLetterQueue.count()));
    }

    @GetMapping("/{id}")
    public ResponseEntity<DeadLetterEntry> getEntry(@PathVariable String id) { ... }

    @PostMapping("/{id}/replay")
    public ResponseEntity<Map<String, String>> replay(@PathVariable String id) { ... }

    @DeleteMapping("/{id}")
    public ResponseEntity<Map<String, String>> discard(@PathVariable String id) { ... }
}

A typical investigation looks like this:

# How many pending entries?
curl http://localhost:8082/api/admin/dlq/count
# {"count": 2}

# What are they?
curl http://localhost:8082/api/admin/dlq
# [{"dlqEntryId":"abc-123", "originalEventId":"evt-456",
#   "eventType":"OrderCreated", "failureReason":"Insufficient stock for product PROD-789",
#   "sourceService":"inventory-service", "status":"PENDING", ...}]

# Get the full details on one
curl http://localhost:8082/api/admin/dlq/abc-123

# Fix the problem (restock inventory), then replay
curl -X POST http://localhost:8082/api/admin/dlq/abc-123/replay
# {"status":"replayed", "dlqEntryId":"abc-123"}

# Or discard if the saga already timed out and compensation ran
curl -X DELETE http://localhost:8082/api/admin/dlq/abc-123
# {"status":"discarded", "dlqEntryId":"abc-123"}

Integration with Saga Listeners

Each saga listener injects the DLQ as an optional dependency:

@Autowired(required = false)
public void setDeadLetterQueue(DeadLetterQueueOperations deadLetterQueue) {
    this.deadLetterQueue = deadLetterQueue;
}

Failed events get routed to the DLQ in the error handler:

private void sendToDeadLetterQueue(GenericRecord record, String topicName, Throwable error) {
    String eventId = record.getString("eventId");
    if (deadLetterQueue != null) {
        try {
            deadLetterQueue.add(DeadLetterEntry.builder()
                    .originalEventId(eventId)
                    .eventType(record.getString("eventType"))
                    .topicName(topicName)
                    .eventRecord(record)
                    .failureReason(error.getMessage())
                    .sourceService("inventory-service")
                    .sagaId(record.getString("sagaId"))
                    .correlationId(record.getString("correlationId"))
                    .build());
            logger.warn("Event {} sent to DLQ after failure: {}", eventId, error.getMessage());
        } catch (Exception dlqError) {
            logger.error("Failed to send event {} to DLQ: {}", eventId, dlqError.getMessage());
        }
    } else {
        // Fallback: existing behavior (log only)
        if (error instanceof ResilienceException) {
            logger.warn("Circuit breaker open, saga step deferred: eventId={}", eventId);
        } else {
            logger.error("Failed to process event: {}", eventId, error);
        }
    }
}

The try/catch around deadLetterQueue.add() is defensive. If the DLQ itself fails — shared cluster unreachable, say — we fall back to logging. The DLQ is best-effort, not a hard requirement. Losing an event and failing to capture it in the DLQ would be truly unlucky, but it shouldn’t bring the service down.

Part 2: Idempotency Guard

The Problem

The transactional outbox gives us at-least-once delivery. Combined with ITopic’s own delivery behavior (listeners that reconnect after a brief disconnection may receive messages again), the same event can arrive at a consumer more than once:

Duplicate delivery sequence: the OutboxPublisher publishes OrderCreated to the shared cluster, which forwards it to the Inventory Listener; the markDelivered call times out, so on the next poll cycle the publisher re-publishes the same event, and without protection the Inventory Listener performs a double stock reservation

Without protection, inventory gets reserved twice. The customer gets charged twice. The order gets confirmed twice. Nobody wants that.

Atomic Check-and-Claim

The fix is Hazelcast’s putIfAbsent — an atomic, cluster-wide check-and-set that ensures each event ID gets processed exactly once:

public class HazelcastIdempotencyGuard implements IdempotencyGuard {

    private final IMap<String, Long> processedEventsMap;
    private final long ttlMillis;
    private final MeterRegistry meterRegistry;

    public HazelcastIdempotencyGuard(HazelcastInstance hazelcast,
                                      IdempotencyProperties properties,
                                      MeterRegistry meterRegistry) {
        this.processedEventsMap = hazelcast.getMap(properties.getMapName());
        this.ttlMillis = properties.getTtl().toMillis();
        this.meterRegistry = meterRegistry;
    }

    @Override
    public boolean tryProcess(final String eventId) {
        Long previous = processedEventsMap.putIfAbsent(
                eventId, System.currentTimeMillis(), ttlMillis, TimeUnit.MILLISECONDS);

        boolean firstTime = (previous == null);
        meterRegistry.counter("idempotency.checks",
                "result", firstTime ? "miss" : "hit").increment();

        if (!firstTime) {
            logger.debug("Duplicate event detected: eventId={}", eventId);
        }

        return firstTime;
    }
}

The interface is one method:

public interface IdempotencyGuard {
    boolean tryProcess(String eventId);
}

Returns true if this is the first time the event ID has been seen — go ahead and process it. Returns false if someone already claimed it — skip.

How putIfAbsent Works

IMap.putIfAbsent(key, value, ttl, timeUnit) is atomic. If the key doesn’t exist, it inserts the pair and returns null. If it does exist, it returns the existing value and does nothing. This atomicity holds across cluster members — two listeners on different nodes processing the same event simultaneously will never both get null. Exactly one wins, the other backs off.

TTL: Forgetting Old Events

The putIfAbsent includes a TTL (default: 1 hour). After that, the event ID is removed from the map, and the same event could theoretically be reprocessed if it somehow arrived again.

Why an hour? It’s a memory management decision. Without a TTL, the processed events map grows forever. With a 1-hour window, we hold at most an hour’s worth of event IDs, which is bounded and predictable. Since our outbox publisher has a 1-second poll interval with 5 retries, duplicates arrive within seconds — an hour of margin is more than sufficient.

Integration with Saga Listeners

Each saga listener checks the guard at the top of its message handler:

class OrderCreatedListener implements MessageListener<GenericRecord> {

    @Override
    public void onMessage(Message<GenericRecord> message) {
        GenericRecord record = message.getMessageObject();

        String eventId = record.getString("eventId");
        if (idempotencyGuard != null && eventId != null
                && !idempotencyGuard.tryProcess(eventId)) {
            logger.debug("Duplicate event {} already processed, skipping", eventId);
            return;
        }

        // ... proceed with normal processing
    }
}

Three null checks for graceful degradation: if idempotency isn’t configured, process everything (no deduplication). If the event doesn’t have an ID, skip the check. If tryProcess() returns false, it’s a duplicate — drop it silently.

The Processed Events Map

The map lives on the shared cluster, so deduplication works across all service instances:

Key	Value	TTL
evt-abc-123	1738000000000 (timestamp)	1 hour
evt-def-456	1738000001000	1 hour
evt-ghi-789	1738000002000	1 hour

The value — a processing timestamp — is purely informational. Only the key’s presence or absence matters for deduplication. But the timestamp is handy for debugging: it tells you exactly when an event was first processed.

How the Three Patterns Work Together

The outbox, DLQ, and idempotency guard form a complete reliability pipeline:

The three patterns working together: on the producer side the EventSourcingController writes to the OUTBOX IMap, the OutboxPublisher polls and publishes to the shared ITopic, then marks the entry delivered; on the consumer side the saga listener checks the IdempotencyGuard (duplicates are skipped), runs executeWithResilience, and on failure routes the event to the framework_DLQ IMap for admin replay or discard

Let’s walk through what happens when things go wrong.

The OrderCreated event comes out of the Jet pipeline and gets written to the outbox. The OutboxPublisher picks it up, publishes to the shared cluster’s OrderCreated ITopic, and tries to mark it DELIVERED. But the markDelivered call times out. Next poll cycle, the publisher re-publishes the same event. Now it’s been delivered twice.

Over on the consumer side, the Inventory Service’s OrderCreatedListener receives both copies. The first call to idempotencyGuard.tryProcess(“evt-123”) returns true — process it. The second call returns false — duplicate, skip it. Only one stock reservation happens.

But that first delivery hits a problem: the product is out of stock. InsufficientStockException is non-retryable. The circuit breaker records the failure, ResilienceException propagates up to whenComplete(), and sendToDeadLetterQueue() captures everything — the full event payload, the failure reason, the saga ID, the source service. It’s all sitting in the framework_DLQ IMap, waiting.

An operator (or an LLM, as we’ll see in a moment) checks the DLQ, sees the pending entry, restocks the product, and replays the event. The OrderCreated record gets re-published to the ITopic, the saga picks up, and the order completes.

One wrinkle: the replayed event carries the same eventId as the original. If the 1-hour idempotency TTL hasn’t expired yet, the guard will block it as a duplicate. In practice this isn’t an issue — by the time you’ve investigated the failure, diagnosed the root cause, and fixed it, an hour has usually passed. It’s a deliberate trade-off: short-window deduplication versus immediate replay. We chose deduplication.

Configuration Reference

Dead Letter Queue: framework.dlq.*

Property	Default	Description
enabled	true	Master toggle
map-name	framework_DLQ	IMap name on shared cluster
max-replay-attempts	3	Maximum replays before permanent block
entry-ttl	168h	7-day retention for DLQ entries

Idempotency Guard: framework.idempotency.*

Property	Default	Description
enabled	true	Master toggle
map-name	framework_PROCESSED_EVENTS	IMap name on shared cluster
ttl	1h	How long to remember processed event IDs

Metrics

Metric	Type	Description
dlq.entries.added	Counter	Events added to the DLQ
dlq.entries.replayed	Counter	Events replayed from the DLQ
dlq.entries.discarded	Counter	Events discarded from the DLQ
idempotency.checks	Counter (tagged: result=hit\|miss)	Deduplication checks

The Complete Resilience Stack

Across Parts 7, 8, and 9, we’ve built five interlocking patterns:

Pattern	Layer	Purpose	Protects Against
Circuit Breaker	Consumer	Automatic service isolation	Cascade failures
Retry + Backoff	Consumer	Transient failure recovery	Network blips, brief outages
Transactional Outbox	Producer	Guaranteed delivery	Shared cluster unavailability
Dead Letter Queue	Consumer	Failure capture and replay	Permanent processing failures
Idempotency Guard	Consumer	Exactly-once processing	Duplicate delivery

They’re all optional — enabled by default, disabled with a single property toggle. They’re all auto-configured by Spring Boot. They all expose Micrometer metrics. And when disabled, the framework falls back to its previous behavior without breaking anything.

Three articles ago, we had a fire-and-forget event pipeline where a network blip could lose an event forever. Now we have guaranteed delivery, deduplication, failure capture, and replay. Same pipeline, five patterns later.

Try It Yourself

The demo script includes a complete DLQ investigation scenario — fault injection, failure capture, investigation, and replay — in 11 guided steps:

./scripts/demo-scenarios.sh 7

That’s the curl-based version. No LLM required.

The AI-Powered Version

This is more fun. Connect the MCP server from Part 6 to your LLM client — Claude Desktop, Claude Code, ChatGPT, whatever you’ve got — and try this prompt:

“Run the DLQ investigation demo — inject a failure, place an order, and show me what’s in the dead letter queue.”

The LLM calls runDemo to set up the scenario, then listDlqEntries and inspectDlqEntry to investigate. It tells you what happened — which event failed, at which service, and why — and suggests a fix. You say “replay it.” It calls replayDlqEntry, the saga completes, and you’ve just done incident response through a conversation.

No curl commands. No JSON parsing. No copy-pasting UUIDs. The LLM handles the plumbing while you make the decisions.

If the LLM already has context from earlier in the session, a shorter version works:

“Run the dlq_investigation demo scenario and tell me what you find.”

Next up: Choreography vs Orchestration: Two Saga Patterns

Previous: Hazelcast Transactional Outbox: Guaranteed Delivery

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

June 29, 2026

Hazelcast Transactional Outbox: Guaranteed Delivery

Part 8 in the “Building Event-Driven Microservices with Hazelcast” series

Introduction

In Part 7, we added circuit breakers and retry to protect saga listeners from transient failures on the consumer side. That covers what happens when a service receives an event and can’t process it. But we haven’t talked about what happens when the event never leaves the building.

Quick refresher on our dual-instance architecture: each service runs an embedded Hazelcast instance for local Jet pipeline processing and a client connected to the shared cluster for cross-service ITopic communication. After the pipeline processes an event, the EventSourcingController republishes it to the shared cluster so saga listeners in other services can react.

That republish step? It was a fire-and-forget call:

// The old approach — fragile
try {
    ITopic<GenericRecord> topic = sharedHazelcast.getTopic(pending.eventType);
    topic.publish(pending.eventRecord);
} catch (Exception e) {
    logger.warn("Failed to republish event {}: {}", pending.eventType, e.getMessage());
    // Event is permanently lost!
}

If the shared cluster is unreachable — network partition, cluster restart, someone tripping over the power cable — the event vanishes. The saga never progresses. Eventually the saga timeout detector marks it as failed, but by then the original event data is gone and there’s nothing to retry.

The Transactional Outbox Pattern fixes this. Instead of publishing directly to the shared cluster, the controller writes the event to a local outbox — an IMap on the embedded Hazelcast instance — and a separate publisher component picks it up and delivers it. If delivery fails, the entry stays in the outbox and gets retried.

Why Direct Publishing Fails

The problem is fundamental. Publishing to an external system (the shared cluster) and completing a local operation (the Jet pipeline) are two separate operations that can’t be made atomic.

Failure timeline for direct publishing — the Jet pipeline updates the local event store and materialized view, but the publish to the shared cluster ITopic fails on a network partition and the event is lost with nothing left to retry

The event is safely stored in the local event store and materialized view, but the cross-service notification is lost. You could retry in place, but that blocks the Jet pipeline for all events. You could schedule an async retry, but if the process restarts, that retry state is gone too.

The outbox pattern trades immediate delivery for guaranteed delivery. Write to a durable local store, deliver asynchronously, retry until it works. It’s the standard solution in event-driven architectures for good reason.

Architecture

The outbox IMap lives on the embedded Hazelcast instance — the same instance that hosts the event store and materialized views. Writing to it is a local operation. If the embedded instance is up (and it must be, since the pipeline just ran), the outbox write succeeds.

The OutboxEntry

Each outbox entry captures everything needed to deliver the event later:

public class OutboxEntry {

    private String eventId;          // Matches the domain event's eventId
    private String eventType;        // ITopic name (e.g., "OrderCreated")
    private GenericRecord eventRecord; // The serialized event to publish
    private int retryCount;          // Delivery attempts so far
    private Status status;           // PENDING, DELIVERED, or FAILED
    private Instant createdAt;       // When the entry was created
    private Instant lastAttemptAt;   // When the last delivery attempt occurred
    private String failureReason;    // Most recent failure message

    public enum Status {
        PENDING,    // Awaiting delivery
        DELIVERED,  // Successfully published to shared cluster
        FAILED      // Permanently failed after max retries
    }
}

The eventRecord field is the full GenericRecord that needs to go to the shared cluster’s ITopic — same record the Jet pipeline produces, complete with saga metadata like sagaId and correlationId.

OutboxStore: The Interface

Six methods covering the full lifecycle:

public interface OutboxStore {

    void write(OutboxEntry entry);

    List<OutboxEntry> pollPending(int maxBatchSize);

    void markDelivered(String eventId);

    void markFailed(String eventId, String reason);

    void incrementRetryCount(String eventId, String failureReason);

    long pendingCount();
}

Provider-agnostic. The Hazelcast implementation uses an IMap, but the interface could just as easily sit in front of a database table.

HazelcastOutboxStore

The Hazelcast implementation stores entries as Compact-serialized GenericRecord values in an IMap:

public class HazelcastOutboxStore implements OutboxStore {

    private static final String SCHEMA_NAME = "OutboxEntry";
    private final IMap<String, GenericRecord> outboxMap;

    public HazelcastOutboxStore(HazelcastInstance hazelcast, MeterRegistry meterRegistry) {
        this.outboxMap = hazelcast.getMap(DEFAULT_MAP_NAME);
    }
}

You might wonder why we’re using GenericRecord instead of storing OutboxEntry Java objects directly. The problem is that OutboxEntry has an Instant field and a nested GenericRecord — neither of which Hazelcast’s zero-config Compact serialization can handle. We’d need a custom CompactSerializer registered on every Hazelcast instance configuration. Instead, we convert at the boundary:

static GenericRecord toRecord(final OutboxEntry entry) {
    return GenericRecordBuilder.compact(SCHEMA_NAME)
            .setString("eventId", entry.getEventId())
            .setString("eventType", entry.getEventType())
            .setGenericRecord("eventRecord", entry.getEventRecord())
            .setInt32("retryCount", entry.getRetryCount())
            .setString("status", entry.getStatus().name())
            .setInt64("createdAt", entry.getCreatedAt().toEpochMilli())
            .setNullableInt64("lastAttemptAt",
                    entry.getLastAttemptAt() != null
                            ? entry.getLastAttemptAt().toEpochMilli() : null)
            .setString("failureReason", entry.getFailureReason())
            .build();
}

A few things going on here. Instant becomes int64 epoch millis — compact, sortable, unambiguous. lastAttemptAt uses setNullableInt64 because it’s null until the first delivery attempt. The nested eventRecord uses setGenericRecord, which Compact handles natively. And status is stored as the enum name string, which makes it readable in Management Center and queryable with Predicates.equal().

Polling uses a Hazelcast predicate to filter by status, sorted by creation time so the oldest entries are delivered first:

@Override
public List<OutboxEntry> pollPending(final int maxBatchSize) {
    final Collection<GenericRecord> pending = outboxMap.values(
            Predicates.equal("status", OutboxEntry.Status.PENDING.name()));

    return pending.stream()
            .map(HazelcastOutboxStore::fromRecord)
            .sorted(Comparator.comparing(OutboxEntry::getCreatedAt))
            .limit(maxBatchSize)
            .collect(Collectors.toList());
}

The OutboxPublisher

The publisher bridges the outbox and the shared cluster. The obvious approach is to poll on a fixed interval — once per second, say — but that adds latency we don’t need. We know exactly when a new entry arrives.

Event-Driven Wake-Up

The publisher uses a Semaphore to sleep until someone signals it:

public class OutboxPublisher {

    private final Semaphore wakeUp = new Semaphore(0);

    public void notifyNewEntry() {
        // Release at most 1 permit — avoids unbounded accumulation
        if (wakeUp.availablePermits() == 0) {
            wakeUp.release();
        }
    }

    public boolean waitForWork() {
        try {
            return wakeUp.tryAcquire(
                    properties.getPollInterval().toMillis(),
                    TimeUnit.MILLISECONDS);
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            return false;
        }
    }
}

When the EventSourcingController writes an outbox entry, it calls notifyNewEntry() right after. The publisher wakes up, claims all pending entries, delivers them. Under normal conditions, the time from event creation to shared-cluster delivery is sub-millisecond.

The poll interval (default 1 second) is the safety net. If a signal gets missed — maybe the publisher was busy with a previous batch — the timeout ensures nothing sits around for too long.

This is a JVM-local semaphore, not a distributed one. That’s fine. When the service scales to multiple replicas with per-service clustering (ADR 013), each replica has its own publisher. The semaphore wakes the local publisher instantly for locally-written events. Events written by other replicas get picked up within the poll interval. The actual coordination — preventing two replicas from delivering the same event — happens in claimPending() via an atomic ClaimEntryProcessor on the IMap.

The Publish Loop

public void publishPendingEntries() {
    if (sharedHazelcast == null) {
        if (!noSharedClusterWarningLogged) {
            logger.warn("No shared Hazelcast instance — outbox delivery skipped");
            noSharedClusterWarningLogged = true;
        }
        return;
    }

    List<OutboxEntry> claimed = outboxStore.claimPending(
            properties.getMaxBatchSize(), memberUuid);

    if (claimed.isEmpty()) {
        return;
    }

    for (OutboxEntry entry : claimed) {
        try {
            ITopic<GenericRecord> topic = sharedHazelcast.getTopic(entry.getEventType());
            topic.publish(entry.getEventRecord());
            outboxStore.markDelivered(entry.getEventId());
        } catch (Exception e) {
            if (entry.getRetryCount() + 1 >= properties.getMaxRetries()) {
                outboxStore.markFailed(entry.getEventId(),
                        "Max retries exceeded: " + e.getMessage());
            } else {
                outboxStore.incrementRetryCount(entry.getEventId(), e.getMessage());
            }
        }
    }
}

Note claimPending rather than pollPending. The claiming mechanism uses an EntryProcessor to atomically transition entries from PENDING to CLAIMED, tagging them with the claiming member’s UUID. This prevents two publisher instances from delivering the same event — important once you’re running multiple replicas.

When no shared cluster is configured (single-node dev mode), the publisher logs one warning and stops trying. Events pile up as PENDING in the outbox. They’ll drain as soon as a shared cluster appears.

Retry escalation is per-entry:

Attempt 1: fails → incrementRetryCount (retryCount=1)
Attempt 2: fails → incrementRetryCount (retryCount=2)
...
Attempt 5: fails → markFailed (retryCount=5 >= maxRetries=5)

Once marked FAILED, the entry stops showing up in claim results. The failure reason is preserved for debugging.

Scheduling

OutboxAutoConfiguration hooks the publisher into Spring’s task scheduler:

@EnableScheduling
public class OutboxAutoConfiguration implements SchedulingConfigurer {

    @Override
    public void configureTasks(ScheduledTaskRegistrar taskRegistrar) {
        taskRegistrar.addFixedDelayTask(() -> {
            outboxPublisher.waitForWork();       // blocks until signaled or timeout
            outboxPublisher.publishPendingEntries();
        }, 1);  // 1ms loop delay — actual timing controlled by semaphore
    }
}

The 1ms fixed delay means the loop restarts almost immediately after each cycle, but waitForWork() controls the actual pacing. The thread blocks on the semaphore until either a permit is released or the poll interval elapses. Near-instant delivery under normal load, guaranteed pickup if a signal is missed.

Integration with EventSourcingController

The controller’s republishToSharedCluster now checks for an outbox store first:

private void republishToSharedCluster(PendingCompletion<K> pending) {
    if (sharedHazelcast == null || pending.eventRecord == null || pending.eventType == null) {
        return;
    }
    if (outboxStore != null) {
        OutboxEntry entry = new OutboxEntry(
                pending.completionInfo.getEventId(),
                pending.eventType,
                pending.eventRecord
        );
        outboxStore.write(entry);
        if (outboxPublisher != null) {
            outboxPublisher.notifyNewEntry();
        }
    } else {
        // Legacy direct publish (when outbox is disabled)
        try {
            ITopic<GenericRecord> topic = sharedHazelcast.getTopic(pending.eventType);
            topic.publish(pending.eventRecord);
        } catch (Exception e) {
            logger.warn("Failed to republish event {}: {}", pending.eventType, e.getMessage());
        }
    }
}

Fully backward compatible. When outboxStore is injected, events go through the durable path. When it’s null, you get the old fire-and-forget behavior. The OutboxStore is wired through each service’s config as an optional dependency:

@Bean
public EventSourcingController<Order, String, DomainEvent<Order, String>> orderController(
        HazelcastInstance hazelcastInstance,
        @Qualifier("hazelcastClient") HazelcastInstance hazelcastClient,
        @Autowired(required = false) OutboxStore outboxStore,
        ...) {
    return EventSourcingController.builder()
            .hazelcast(hazelcastInstance)
            .sharedHazelcast(hazelcastClient)
            .outboxStore(outboxStore)
            .build();
}

Delivery Guarantees

The outbox provides at-least-once delivery. If the publisher crashes after publishing to the ITopic but before calling markDelivered(), the next cycle picks up the same entry and delivers it again. Events are never lost as long as the embedded Hazelcast instance’s IMap data is intact.

At-least-once means consumers may see duplicates. That’s where the Idempotency Guard from Part 9 comes in — it deduplicates on the consumer side, complementing the outbox’s guaranteed delivery.

As for ordering: events for the same aggregate are written to the outbox in sequence order (the Jet pipeline processes them sequentially), and claimPending sorts by createdAt. But if two events are pending simultaneously and the first one fails while the second succeeds, they’ll arrive out of order. For our saga use case that’s acceptable — each step is identified by sagaId and eventType, and the saga state machine handles duplicates and out-of-order delivery.

Configuration

framework.outbox.*

Property	Default	Description
enabled	true	Master toggle for the outbox pattern
poll-interval	1000 (ms)	Fallback interval if signal is missed
max-batch-size	50	Maximum entries per poll cycle
max-retries	5	Delivery attempts before permanent failure
entry-ttl	24h	How long DELIVERED entries survive in the map

Metrics

Metric	Type	Description
outbox.entries.written	Counter	Events written to the outbox
outbox.entries.delivered	Counter	Events delivered to shared cluster
outbox.entries.failed	Counter	Events permanently failed
outbox.publish.duration	Timer	Time per publish cycle

To disable the outbox and use direct publishing:

framework:
  outbox:
    enabled: false

What’s Next

The outbox guarantees events reach the shared cluster. But what happens when they get there and the consumer can’t process them? The consumer might crash, the business logic might throw, the circuit breaker might be open.

In Part 9, we add two patterns that work together: a Dead Letter Queue that captures events that fail consumer-side processing, and an Idempotency Guard that prevents duplicate processing — the natural flip side of at-least-once delivery.

Next up: Dead Letter Queues and Idempotency

Previous: Circuit Breakers and Retry: Resilient Hazelcast Sagas

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

June 22, 2026

Circuit Breakers and Retry: Resilient Hazelcast Sagas

Part 7 in the “Building Event-Driven Microservices with Hazelcast” series

Introduction

A commercial airliner doesn’t fall out of the sky when an engine fails. It keeps flying. The remaining engine provides enough thrust to reach the nearest airport, the crew follows a well-rehearsed procedure, and the passengers — ideally — never know how close things got. Aviation engineers figured this out decades ago: you can’t prevent every failure, so you build the system to keep working when parts of it stop. (There’s even a great acronym for it — ETOPS, which officially stands for Extended Twin-engine Operations Performance Standards, but which pilots will tell you really means “Engines Turn Or Passengers Swim.”)

Microservices need the same philosophy. Not because individual services fail as dramatically as a jet engine, but because they fail far more often. A garbage collection pause. A network blip. A downstream provider having a bad day. A deployment rolling through the cluster at 2 AM. In a monolith, these are minor hiccups — the kind of thing you might not even notice in the logs. In a distributed system where five services coordinate through asynchronous events, a hiccup in one service can propagate to all five in the time it takes to brew a cup of coffee.

And the ways things go wrong are… creative. The catalog of distributed system failure modes is large enough to fill a textbook. Several textbooks, actually — and people have. Too many for a single pattern or a single blog post.

So we’re spending the next three posts on resilience. This one covers circuit breakers and retry — protecting saga listeners when downstream services misbehave. Part 8 tackles the transactional outbox pattern, which guarantees events aren’t lost between producer and consumer. And Part 9 adds dead letter queues and idempotency guards — the safety nets for events that fail permanently or arrive more than once. Three different failure modes, three different mechanisms.

Back in Part 4, we built a choreographed saga for order fulfillment. Three services — Inventory, Payment, and Order — coordinate through Hazelcast ITopic events published on a shared cluster. The happy path works beautifully. Without resilience patterns, though, a single struggling service can drag the whole saga down with it. A slow Payment Service fills up the Inventory Service’s thread pool with blocked calls. A transient network error permanently loses an event. A burst of failures overwhelms everything simultaneously.

That’s what we’re fixing.

The Problem: Cascading Failures

Here’s the order fulfillment saga on a good day:

Order fulfillment saga happy path — Inventory, Payment, and Order services exchanging OrderCreated, StockReserved, PaymentProcessed, and OrderConfirmed events over Hazelcast ITopic

Each step is an ITopic message on the shared Hazelcast cluster. Each listener calls a local service method — IMap operations, Jet pipeline processing, further ITopic publishing. Events flow, state updates, everyone’s happy.

Now imagine the Payment Service is having a rough morning. Some downstream payment provider is dragging, and every StockReserved event that arrives takes 30 seconds to process instead of the normal 50 milliseconds. Without any resilience mechanism, here’s what unfolds:

Inventory keeps publishing StockReserved events at the normal rate
Payment’s listener thread pool fills up with slow calls
New events queue behind the blocked threads
ITopic backpressure eventually slows the shared cluster itself
Other listeners on the same cluster — including Inventory and Order — start seeing delays
The entire saga grinds to a halt

One service had a problem. Now every service has a problem. This is a cascade failure, and it’s the defining hazard of distributed architectures. The shared communication fabric that makes coordination possible is the same fabric that propagates failure.

Enter Resilience4j

The patterns we need — circuit breakers, retry with backoff, bulkheads, rate limiters — have been well understood for years. Netflix popularized them in the Java world with Hystrix, which became the standard library for microservice resilience through most of the 2010s. But Netflix put Hystrix into maintenance mode in 2018 and eventually stopped development entirely.

The successor that emerged is Resilience4j. It’s a lightweight fault tolerance library for Java 8+ built around functional composition — you wrap a Supplier or Runnable with decorators, and the decorators handle the resilience logic. It’s not just a circuit breaker library, though that’s what most people know it for. It actually provides six core modules: circuit breaker, retry, bulkhead (resource isolation), rate limiter, time limiter, and cache. Each is standalone. You pick what you need and leave the rest on the shelf.

There are other options — Failsafe is a solid zero-dependency alternative, and Alibaba’s Sentinel targets high-traffic rate limiting scenarios. But Resilience4j has become the de facto choice for Spring Boot microservices. The Spring integration is mature, Micrometer metrics work out of the box, and @ConfigurationProperties binding means your resilience settings live in the same YAML as everything else. For our framework, we’re using two of the six modules: CircuitBreaker and Retry.

Circuit Breakers: Automatic Service Isolation

A circuit breaker does what it sounds like. It monitors the failure rate of an operation and automatically stops calling it when failures exceed a threshold — the same idea as the breaker panel in your house. Too much current flows through the circuit, the breaker trips, the wiring doesn’t catch fire. In our case, “too much current” means too many failed calls, and “the wiring” is every other service sharing that communication path.

Three States

Circuit breaker state machine — CLOSED trips to OPEN when the failure rate crosses the threshold, OPEN moves to HALF-OPEN after the wait duration, and HALF-OPEN returns to CLOSED on success or back to OPEN on failure

CLOSED is normal operation. All calls pass through, and the circuit breaker quietly records outcomes in a sliding window. OPEN means the breaker has tripped — all calls are immediately rejected with a CallNotPermittedException, and no load reaches the downstream service at all. HALF-OPEN is the recovery probe: a limited number of test calls pass through. If they succeed, the breaker returns to CLOSED. If they fail, back to OPEN. Rinse and repeat until the downstream service gets its act together.

The Framework’s ResilientServiceInvoker

Rather than sprinkling Resilience4j decorators at every call site, we centralized everything into ResilientServiceInvoker:

public class ResilientServiceInvoker implements ResilientOperations {

    private final CircuitBreakerRegistry circuitBreakerRegistry;
    private final RetryRegistry retryRegistry;
    private final ResilienceProperties properties;

    public <T> T execute(final String name, final Supplier<T> operation) {
        if (!properties.isEnabled()) {
            return operation.get();
        }

        final CircuitBreaker circuitBreaker = circuitBreakerRegistry.circuitBreaker(name);
        final Retry retry = retryRegistry.retry(name);

        final Supplier<T> decoratedSupplier = CircuitBreaker.decorateSupplier(circuitBreaker,
                Retry.decorateSupplier(retry, operation));

        try {
            return decoratedSupplier.get();
        } catch (CallNotPermittedException e) {
            logger.warn("Circuit breaker '{}' is OPEN — rejecting call", name);
            throw new ResilienceException(
                    "Circuit breaker '" + name + "' is open, call rejected", name, e);
        } catch (Exception e) {
            logger.error("Operation '{}' failed after retries: {}", name, e.getMessage());
            throw new ResilienceException(
                    "Operation '" + name + "' failed after retries", name, e);
        }
    }
}

A few things to notice here. Each call to execute(“inventory-stock-reservation”, …) creates or retrieves a circuit breaker and retry instance with that name. This means each saga step gets its own independent circuit breaker — a payment failure won’t trip the inventory breaker.

The decoration order matters: retry wraps the operation first, then the circuit breaker wraps the retry. So the circuit breaker sees the final outcome after all retries are exhausted. A transient failure that succeeds on the second attempt counts as a success for the circuit breaker. If you stacked them the other way around, every individual failed attempt would register as a circuit breaker failure, and you’d trip the breaker much faster than you intended.

And there’s a kill switch. When framework.resilience.enabled=false, the execute method just calls the operation directly. Zero overhead. This matters for testing and for environments where resilience is handled at a different layer — a service mesh, maybe, or a cloud provider’s load balancer.

The ResilientOperations Interface

We extract an interface from the concrete class:

public interface ResilientOperations {
    <T> T execute(String name, Supplier<T> operation);
    void executeRunnable(String name, Runnable operation);
    <T> CompletableFuture<T> executeAsync(String name, Supplier<CompletableFuture<T>> operation);
}

This is the same workaround we used for ServiceClientOperations in Part 6. Java 25’s Mockito inline mock maker can’t mock concrete classes in certain JVM configurations, so you extract an interface and mock that instead. Not the most glamorous reason to create an abstraction, but it works.

Three Flavors

The invoker supports three calling patterns:

// Synchronous — returns a value
String result = invoker.execute("orderSaga", () -> processEvent(event));

// Fire-and-forget — void operation
invoker.executeRunnable("paymentListener", () -> publishToTopic(event));

// Async — returns CompletableFuture
CompletableFuture<Product> future = invoker.executeAsync("inventory-stock-reservation",
        () -> inventoryService.reserveStockForSaga(productId, quantity, ...));

The async variant is the one our saga listeners actually use — inventory, payment, and order service calls all return CompletableFuture.

Wiring into the Saga Listeners

The saga listeners from Part 4 now inject ResilientOperations as an optional dependency:

@Component
public class InventorySagaListener {

    private final ProductService inventoryService;
    private final HazelcastInstance hazelcast;
    private ResilientOperations resilientServiceInvoker;

    @Autowired(required = false)
    public void setResilientOperations(ResilientOperations resilientServiceInvoker) {
        this.resilientServiceInvoker = resilientServiceInvoker;
    }

That @Autowired(required = false) is doing important work. If resilience is disabled — or if the Resilience4j dependency isn’t even on the classpath — the listener still functions. It just calls the service directly, no wrapping. The saga worked before we added resilience; it should keep working without it.

Each listener has a helper that handles the null check:

private <T> CompletableFuture<T> executeWithResilience(
        final String name, final Supplier<CompletableFuture<T>> operation) {
    if (resilientServiceInvoker != null) {
        return resilientServiceInvoker.executeAsync(name, operation);
    }
    return operation.get();
}

And the actual saga step looks like this:

executeWithResilience("inventory-stock-reservation",
        () -> inventoryService.reserveStockForSaga(
                productId, quantity, orderId, sagaId, correlationId,
                customerId, total, currency, "CREDIT_CARD"
        )
).whenComplete((product, error) -> {
    if (error != null) {
        sendToDeadLetterQueue(record, "OrderCreated", error);
    } else {
        logger.info("Stock reserved for saga: productId={}, quantity={}, orderId={}, sagaId={}",
                productId, quantity, orderId, sagaId);
    }
});

The circuit breaker name inventory-stock-reservation is specific to this saga step. Each step across the three services gets its own name and its own circuit breaker:

Circuit Breaker Name	Saga Step	Service
inventory-stock-reservation	Reserve stock on OrderCreated	Inventory
inventory-stock-release	Release stock on compensation	Inventory
payment-processing	Process payment on StockReserved	Payment
payment-refund	Refund payment on compensation	Payment
order-confirmation	Confirm order on PaymentProcessed	Order
order-cancellation	Cancel order on compensation	Order

Six independent circuit breakers. If payment processing is struggling, the inventory breakers stay closed and keep doing their job.

Retry with Exponential Backoff

Transient failures — network blips, temporary overload, brief GC pauses — are the most common failure mode in distributed systems. Most of them resolve on their own within seconds. Retry is the first line of defense.

The Thundering Herd

But naive retry — retry immediately, same interval, keep hammering — can make things actively worse. Picture this: a service buckles under load, and 100 clients all get errors simultaneously. They all retry at 500ms. The service sees a spike of 100 simultaneous requests. It fails again. They all retry at 1000ms. Another spike. Same result.

This is the thundering herd problem. Everyone backs off at the same fixed interval, and everyone comes stampeding back at the same moment. The retry mechanism that was supposed to help is the thing keeping the service down.

Exponential backoff breaks the herd apart:

Attempt 1: immediate
Attempt 2: wait 500ms
Attempt 3: wait 1000ms  (500ms × 2.0)
Attempt 4: wait 2000ms  (1000ms × 2.0)

The growing intervals give the struggling service breathing room. And because different callers started their retry sequences at slightly different moments, the backoff naturally staggers the waves. Each one arrives smaller and more spread out than the last. The herd thins itself out.

Configuration

The framework exposes all of this through ResilienceProperties:

framework:
  resilience:
    enabled: true
    retry:
      max-attempts: 3
      wait-duration: 500ms
      enable-exponential-backoff: true
      exponential-backoff-multiplier: 2.0

The auto-configuration translates these into a Resilience4j RetryConfig:

@Bean
@ConditionalOnMissingBean
public RetryRegistry retryRegistry(final ResilienceProperties properties) {
    final ResilienceProperties.RetryProperties retryProps = properties.getRetry();

    final RetryConfig.Builder<?> builder = RetryConfig.custom()
            .maxAttempts(retryProps.getMaxAttempts())
            .retryOnException(e -> !(e instanceof NonRetryableException));

    if (retryProps.isEnableExponentialBackoff()) {
        builder.intervalFunction(IntervalFunction
                .ofExponentialBackoff(
                        retryProps.getWaitDuration(),
                        retryProps.getExponentialBackoffMultiplier()));
    } else {
        builder.waitDuration(retryProps.getWaitDuration());
    }

    return RetryRegistry.of(builder.build());
}

Two things to note. The retryOnException predicate excludes NonRetryableException — we’ll get to that in a moment. And when enable-exponential-backoff is false, it falls back to a fixed interval between attempts.

NonRetryableException: When to Stop Trying

Not every failure is transient. “Payment declined” will never succeed on retry — the credit card is invalid. “Insufficient stock” is deterministic — the warehouse genuinely doesn’t have the product. Retrying these wastes time, wastes resources, and — if the circuit breaker is counting — burns through your failure budget for no reason.

The framework defines a marker interface:

public interface NonRetryableException {
    // Marker interface — business exceptions implement this to skip retry
}

Service exceptions opt in:

public class InsufficientStockException extends RuntimeException
        implements NonRetryableException {
    public InsufficientStockException(String message) {
        super(message);
    }
}

public class PaymentDeclinedException extends RuntimeException
        implements NonRetryableException {
    public PaymentDeclinedException(String message) {
        super(message);
    }
}

Why a marker interface instead of a base class? Because these exceptions already extend RuntimeException. Java doesn’t have multiple inheritance, but it does have multiple interfaces. The marker lets any exception opt out of retry without changing its class hierarchy.

The retry configuration’s predicate is one line:

.retryOnException(e -> !(e instanceof NonRetryableException))

When retry encounters one of these, it fails immediately. No backoff, no additional attempts. But the circuit breaker still records it as a failure — it still counts toward the failure rate threshold. This is the right behavior. If a service is returning “payment declined” for every single request, something is systematically wrong, and the circuit breaker should trip.

Retry Observability

Resilience4j publishes events for every retry attempt, and the framework hooks into them for structured logging and a custom metric:

public class RetryEventListener {

    public RetryEventListener(final RetryRegistry retryRegistry,
                              final MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;

        retryRegistry.getAllRetries().forEach(this::registerListeners);
        retryRegistry.getEventPublisher().onEntryAdded(
                event -> registerListeners(event.getAddedEntry()));
    }

    private void registerListeners(final Retry retry) {
        final var eventPublisher = retry.getEventPublisher();
        eventPublisher.onRetry(this::onRetry);
        eventPublisher.onSuccess(this::onSuccess);
        eventPublisher.onError(this::onError);
        eventPublisher.onIgnoredError(this::onIgnoredError);
    }
}

Four event types give you the full picture:

Event	Log Level	What happened
onRetry	WARN	An attempt failed, trying again
onSuccess	INFO	Eventually succeeded
onError	ERROR	All retries exhausted
onIgnoredError	INFO	Non-retryable, skipped retry

That last one — onIgnoredError — needed a custom Micrometer counter because Resilience4j’s built-in TaggedRetryMetrics doesn’t track ignored errors:

private void onIgnoredError(final RetryOnIgnoredErrorEvent event) {
    logger.info("Non-retryable exception for '{}', skipping retry: {}",
            event.getName(), event.getLastThrowable().getMessage());

    Counter.builder("framework.resilience.retry.ignored")
            .description("Count of non-retryable exceptions that skipped retry")
            .tag("name", event.getName())
            .register(meterRegistry)
            .increment();
}

In practice, the logs tell you a clear story. A transient failure that recovers:

WARN  RetryEventListener - Retry attempt #1 for 'payment-processing': Connection refused
WARN  RetryEventListener - Retry attempt #2 for 'payment-processing': Connection refused
INFO  RetryEventListener - 'payment-processing' succeeded after 2 attempt(s)

A business exception that gets kicked straight to the dead letter queue:

INFO  RetryEventListener - Non-retryable exception for 'payment-processing',
      skipping retry: Insufficient funds for amount 15000.00

The ResilienceException Wrapper

When an operation exhausts all retries or gets rejected by an open circuit breaker, the framework wraps the failure in a ResilienceException:

public class ResilienceException extends RuntimeException {

    private final String operationName;

    public ResilienceException(String message, String operationName, Throwable cause) {
        super(message, cause);
        this.operationName = operationName;
    }
}

The operationName field tells downstream handlers which circuit breaker failed. The dead letter queue integration (Part 9) uses this to classify failures:

if (error instanceof ResilienceException) {
    logger.warn("Circuit breaker open, saga step deferred: eventId={}", eventId);
} else {
    logger.error("Failed to process event: {}", eventId, error);
}

Auto-Configuration

The whole resilience stack is wired through a single auto-configuration class:

@Configuration
@ConditionalOnClass(CircuitBreakerRegistry.class)
@ConditionalOnProperty(name = "framework.resilience.enabled", matchIfMissing = true)
@EnableConfigurationProperties(ResilienceProperties.class)
public class ResilienceAutoConfiguration {

    @Bean @ConditionalOnMissingBean
    public CircuitBreakerRegistry circuitBreakerRegistry(ResilienceProperties properties) { ... }

    @Bean @ConditionalOnMissingBean
    public RetryRegistry retryRegistry(ResilienceProperties properties) { ... }

    @Bean @ConditionalOnMissingBean
    public ResilientServiceInvoker resilientServiceInvoker(...) { ... }

    @Bean @ConditionalOnMissingBean(TaggedCircuitBreakerMetrics.class)
    public TaggedCircuitBreakerMetrics taggedCircuitBreakerMetrics(...) { ... }

    @Bean @ConditionalOnMissingBean(TaggedRetryMetrics.class)
    public TaggedRetryMetrics taggedRetryMetrics(...) { ... }

    @Bean @ConditionalOnMissingBean
    public RetryEventListener retryEventListener(...) { ... }
}

Three conditionals control activation. @ConditionalOnClass(CircuitBreakerRegistry.class) means the whole thing only activates when Resilience4j is on the classpath — services that don’t include the dependency don’t get any resilience beans. @ConditionalOnProperty(…, matchIfMissing = true) means it’s enabled by default; set framework.resilience.enabled=false to turn it off. And every individual bean is @ConditionalOnMissingBean, so the application can override any piece by defining its own bean.

Six beans total:

CircuitBreakerRegistry — circuit breaker instances, configured from properties
RetryRegistry — retry instances with optional exponential backoff
ResilientServiceInvoker — the decorator that wraps operations
TaggedCircuitBreakerMetrics — binds circuit breaker metrics to Micrometer
TaggedRetryMetrics — binds retry metrics to Micrometer
RetryEventListener — structured logging and the custom ignored-error counter

Per-Instance Tuning

Different saga steps have different tolerance for failure. Stock reservation should be fast and reliable — if it’s failing, something is seriously wrong, and we want the circuit to trip quickly. Payment processing, on the other hand… payment providers are notoriously flaky. You’d rather tolerate a higher failure rate and give the provider more time to sort itself out before you start rejecting everything.

The framework supports per-instance overrides in each service’s application.yml:

framework:
  resilience:
    enabled: true
    circuit-breaker:
      failure-rate-threshold: 50
      wait-duration-in-open-state: 10s
      sliding-window-size: 10
      minimum-number-of-calls: 5
      permitted-number-of-calls-in-half-open-state: 3
    retry:
      max-attempts: 3
      wait-duration: 500ms
      enable-exponential-backoff: true
      exponential-backoff-multiplier: 2.0
    instances:
      inventory-stock-reservation:
        circuit-breaker:
          failure-rate-threshold: 40
          wait-duration-in-open-state: 5s
        retry:
          max-attempts: 2
      payment-processing:
        circuit-breaker:
          failure-rate-threshold: 60
          wait-duration-in-open-state: 15s
        retry:
          max-attempts: 5
          wait-duration: 1s

The instances map lets any named circuit breaker override the defaults:

public CircuitBreakerProperties getCircuitBreakerForInstance(final String name) {
    final InstanceProperties instance = instances.get(name);
    if (instance != null && instance.getCircuitBreaker() != null) {
        return instance.getCircuitBreaker();
    }
    return circuitBreaker; // Fall back to defaults
}

So in this configuration, inventory-stock-reservation trips at 40% failure rate with a 5-second open state and only 2 retry attempts — stock checks are idempotent and fast, no point dragging things out. payment-processing tolerates 60% failure rate with a 15-second open state and 5 retries starting at 1-second intervals. With exponential backoff, that last attempt waits about 16 seconds. Payment providers get the patience they’ve trained us to give them.

Metrics and Monitoring

The auto-configuration binds circuit breaker and retry metrics to Micrometer, which exports to Prometheus for Grafana dashboards:

Circuit Breaker Metrics

Metric	Type	Description
resilience4j_circuitbreaker_state	Gauge	Current state (0=CLOSED, 1=OPEN, 2=HALF_OPEN)
resilience4j_circuitbreaker_calls_total	Counter	Total calls by outcome (successful, failed, not_permitted)
resilience4j_circuitbreaker_failure_rate	Gauge	Current failure rate percentage
resilience4j_circuitbreaker_buffered_calls	Gauge	Calls in sliding window

Retry Metrics

Metric	Type	Description
resilience4j_retry_calls_total	Counter	Total calls by outcome (successful_without_retry, successful_with_retry, failed_with_retry, failed_without_retry)
framework.resilience.retry.ignored	Counter	Non-retryable exceptions (tagged by name)

These feed into Grafana panels for saga health — circuit breaker state timeline showing when breakers trip and recover, retry rate over time where a spike tells you something transient is happening, failure rate broken out by saga step so you can see which one is misbehaving, and the non-retryable exception count that separates business logic failures from infrastructure problems.

Configuration Reference

framework.resilience.*

Property	Default	Description
enabled	true	Master toggle for all resilience features
circuit-breaker.failure-rate-threshold	50	Failure rate (%) to trip the breaker
circuit-breaker.wait-duration-in-open-state	10s	How long to stay open before testing
circuit-breaker.sliding-window-size	10	Number of calls in the measurement window
circuit-breaker.sliding-window-type	COUNT_BASED	COUNT_BASED or TIME_BASED
circuit-breaker.minimum-number-of-calls	5	Minimum calls before evaluating failure rate
circuit-breaker.permitted-number-of-calls-in-half-open-state	3	Test calls in half-open state
retry.max-attempts	3	Maximum retry attempts (including initial)
retry.wait-duration	500ms	Base wait between retries
retry.enable-exponential-backoff	true	Use exponential backoff
retry.exponential-backoff-multiplier	2.0	Backoff multiplier
instances.<name>.circuit-breaker.*	(defaults)	Per-instance circuit breaker overrides
instances.<name>.retry.*	(defaults)	Per-instance retry overrides

What’s Next

Circuit breakers and retry handle one category of failure: transient problems during event consumption. The saga listener tries, the call fails, the retry policy kicks in, the circuit breaker keeps the damage from spreading. That covers the consumer side.

But what about the producer side? When EventSourcingController needs to republish an event to the shared cluster and the cluster is temporarily unreachable, the event just… vanishes. No retry. No circuit breaker. Gone.

That’s a different failure mode, and it needs a different mechanism. In Part 8, we add the transactional outbox pattern — a durable buffer between event production and cross-cluster delivery that guarantees no events are lost, even when the shared cluster is down. Then Part 9 closes the loop with dead letter queues and idempotency guards for events that exhaust all retries or arrive more than once.

Next up: The Transactional Outbox Pattern with Hazelcast

Previous: MCP Server for Microservices: AI-Powered Debugging

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

June 15, 2026

MCP Server for Microservices: AI-Powered Debugging

Part 6 in the “Building Event-Driven Microservices with Hazelcast” series

Introduction

Over the first five articles, we built an event sourcing framework, a Jet pipeline, materialized views, a choreographed saga pattern, and vector similarity search. That’s a lot of infrastructure. It also means that investigating a problem — say, a failed saga — involves chaining together five or six curl commands across four different services, reading JSON output with your eyes, extracting IDs by hand, and constructing the next request.

Which is fine. It’s what we’ve always done. But there’s a better option now.

The Model Context Protocol (MCP) is an open standard that lets AI assistants — Claude, ChatGPT, Copilot, whoever — call tools exposed by external servers. Instead of the assistant guessing at curl commands or asking you to copy-paste output, it directly queries your materialized views, submits events, inspects saga state, and runs demo scenarios.

In this article, we build an MCP server that bridges AI assistants to our eCommerce microservices. And yes, there is something a little meta about using Claude to build a framework and then building a bridge so Claude can operate the framework. We’re going with it.

Why Give an AI Access to Your Microservices?

Consider a typical debugging session. A saga has failed, and you want to know why:

# Step 1: Find failed sagas
curl http://localhost:8083/api/sagas?status=FAILED

# Step 2: Copy a saga ID from the JSON output
curl http://localhost:8083/api/sagas/saga-a7f3e2

# Step 3: Check the order that triggered it
curl http://localhost:8083/api/orders/ord-12345

# Step 4: Check the event history
curl http://localhost:8083/api/orders/ord-12345/events

# Step 5: Check if stock was released as part of compensation
curl http://localhost:8082/api/products/prod-67890

Five commands. Each one requires reading JSON output, finding the right ID, and constructing the next request. You’re doing the orchestration in your head, and — let’s be honest — that’s exactly the kind of tedious mechanical chaining that humans are bad at and computers are good at.

With MCP, the same investigation is a single sentence:

“Why did the most recent saga fail?”

The AI calls list_sagas(status=”FAILED”), then inspect_saga(sagaId=”saga-a7f3e2″), then get_event_history(aggregateId=”ord-12345″, aggregateType=”Order”), interprets all the responses, and gives you a summary:

“Saga saga-a7f3e2 failed at the payment step. Order ORD-12345 had a total of $15,000 which exceeded the $10,000 payment limit. Compensation ran successfully — stock for product PROD-67890 was released.”

Five tool calls, zero curl commands, a root-cause analysis, and a recommendation. From one question.

What Is MCP?

MCP (Model Context Protocol) is an open specification by Anthropic that defines a standard interface between AI assistants and external tools. Think of it as a contract:

MCP protocol sequence: the AI assistant sends tools/list and tools/call to the MCP server, which returns tool definitions and JSON results over JSON-RPC

The protocol uses JSON-RPC 2.0 over one of two transports:

Transport	How It Works	Best For
stdio	AI assistant launches the server as a subprocess; communicates via stdin/stdout	Local development with Claude Code or Claude Desktop
SSE (HTTP)	Server runs as a web service; AI connects over HTTP with Server-Sent Events	Docker, remote deployment, multi-user

The AI assistant doesn’t need to know anything about Hazelcast, Jet pipelines, or event sourcing. It sees ten tools with descriptions and parameters. The MCP server handles the translation between “query the customer view” and “GET http://account-service:8081/api/customers.”

Designing Tools Around Event Sourcing

The hardest part of building an MCP server isn’t the protocol — it’s deciding what tools to expose. Too many and the AI gets confused about which one to use. Too few and it can’t do useful work. We went back and forth on this and started with seven, organized around the three concerns of an event-sourced system. Three more got added later for dead letter queue recovery, which we’ll get to in a moment.

Queries (Read Current State)

Tool	What It Does
query_view	Read materialized views — current state of customers, products, orders, payments
get_event_history	Read the event log — how an entity reached its current state

These map to the read side of CQRS. Views give you the “what,” event history gives you the “why.”

Commands (Produce New Events)

Tool	What It Does
submit_event	Create customers, products, orders; cancel orders; process payments; refund payments
run_demo	Execute multi-step scenarios (happy path, payment failure, saga timeout, sample data)

Each command produces domain events that flow through the Jet pipeline. run_demo chains multiple commands together to set up investigation targets — a failed payment saga, a timeout scenario, a happy path to compare against.

Observability (Inspect the System)

Tool	What It Does
inspect_saga	View a saga’s status, steps completed, timing, and failure reason
list_sagas	Browse sagas filtered by status
get_metrics	Aggregated system metrics — saga counts, event throughput, active gauges

Dead Letter Queue (Investigate and Replay Failures)

Tool	What It Does
list_dlq_entries	List failed events that landed in the dead letter queue, with a pending-count summary for quick triage
inspect_dlq_entry	View a single DLQ entry: event data, failure reason, saga context, replay count
replay_dlq_entry	Republish a DLQ entry’s event for reprocessing — after the cause is fixed

We hadn’t built the DLQ machinery yet when the MCP server first shipped, so these three were added later. The investigation workflow — list, inspect, then decide to replay or not — turned out to map cleanly onto how a human operator works through a queue of failed events. Asking the AI to walk that with you, one entry at a time, is dramatically less tedious than the curl version.

Ten tools, four categories, no overlap. The AI handles any reasonable question about the system, and tool selection stays reliable — you’d never call get_metrics when you meant query_view, or list_dlq_entries when you meant list_sagas. The shape of the tool decides which question it answers.

Architecture: A Pure REST Proxy

The MCP server sits between the AI assistant and the microservices:

MCP server architecture: an AI assistant connects via the MCP protocol to a Spring Boot MCP server on port 8085, which proxies REST calls to the Account, Inventory, Order, and Payment services

We made a deliberate choice here: the MCP server has no Hazelcast dependency. It doesn’t join any cluster, doesn’t read IMaps, doesn’t run Jet jobs. It’s a thin REST proxy that translates MCP tool calls into HTTP requests against the existing service APIs.

Why go to the trouble of keeping them separate? Because coupling the MCP server to Hazelcast would mean classpath conflicts with the services, a dependency on the data layer that makes testing painful, and another component that needs Hazelcast configuration. As a pure proxy, the server needs maybe 128-256 MB of heap, has no classpath conflicts, and you can test every tool by mocking REST responses without running a single service.

Implementation

The ServiceClient

All HTTP communication goes through one class:

@Component
public class ServiceClient implements ServiceClientOperations {

    private final McpServerProperties properties;
    private final RestClient restClient;

    public Map<String, Object> getEntity(String viewName, String id) {
        String url = resolveUrl(viewName) + "/" + id;
        String json = restClient.get().uri(url).retrieve().body(String.class);
        return parseMap(json);
    }

    String resolveUrl(String viewName) {
        return switch (viewName.toLowerCase()) {
            case "customer" -> properties.getAccountUrl() + "/api/customers";
            case "product"  -> properties.getInventoryUrl() + "/api/products";
            case "order"    -> properties.getOrderUrl() + "/api/orders";
            case "payment"  -> properties.getPaymentUrl() + "/api/payments";
            default -> throw new IllegalArgumentException("Unknown view: " + viewName);
        };
    }
}

That resolveUrl switch is the only place that knows which service owns which view. Every tool delegates to ServiceClient rather than making HTTP calls directly.

The ServiceClientOperations interface exists because Mockito’s inline mock maker on Java 25 cannot mock concrete classes. We hit this wall across the framework — the solution every time was to extract an interface so tests can mock it. It’s a slightly annoying pattern, but it works.

A Tool Implementation

Each tool is a Spring @Service with a @Tool-annotated method. Here’s QueryViewTool:

@Service
public class QueryViewTool {

    private final ServiceClientOperations serviceClient;

    @Tool(description = "Query a materialized view. "
            + "Available views: customer, product, order, payment. "
            + "Provide a key to get a specific entity, or omit to list entities.")
    public String queryView(
            @ToolParam(description = "View to query: customer, product, order, or payment")
            String viewName,
            @ToolParam(description = "Optional: specific entity ID", required = false)
            String key,
            @ToolParam(description = "Max results when listing (default: 10)", required = false)
            Integer limit) {

        if (key != null && !key.isBlank()) {
            return toJson(serviceClient.getEntity(viewName, key));
        } else {
            int effectiveLimit = (limit != null && limit > 0) ? limit : 10;
            List<Map<String, Object>> results = serviceClient.listEntities(viewName, effectiveLimit);
            return toJson(Map.of(
                    "view", viewName,
                    "count", results.size(),
                    "entities", results
            ));
        }
    }
}

That @Tool description is doing real work. The AI reads it to decide which tool to call and what parameters to provide. If you’re vague — “query data” instead of “Query a materialized view. Available views: customer, product, order, payment” — the AI picks the wrong tool or provides wrong parameters. We learned this the hard way. Be specific. Name the available views. Explain what happens with versus without a key.

The optional parameters with defaults matter too. When the AI omits key, the tool lists entities. When it omits limit, you get 10. This lets a single tool handle “show me all customers” and “look up customer cust-123” without the AI needing to figure out everything every time.

Tool Registration

All ten tools get registered in one place:

@Configuration
public class McpToolConfig {

    @Bean
    public ToolCallbackProvider mcpTools(QueryViewTool queryView,
                                         SubmitEventTool submitEvent,
                                         GetEventHistoryTool getEventHistory,
                                         InspectSagaTool inspectSaga,
                                         ListSagasTool listSagas,
                                         GetMetricsTool getMetrics,
                                         RunDemoTool runDemo,
                                         ListDlqEntriesTool listDlqEntries,
                                         InspectDlqEntryTool inspectDlqEntry,
                                         ReplayDlqEntryTool replayDlqEntry) {
        return MethodToolCallbackProvider.builder()
                .toolObjects(queryView, submitEvent, getEventHistory,
                        inspectSaga, listSagas, getMetrics, runDemo,
                        listDlqEntries, inspectDlqEntry, replayDlqEntry)
                .build();
    }
}

Spring AI’s MethodToolCallbackProvider scans each object for @Tool methods and registers them with the MCP server. When the AI calls tools/list, it gets back all ten tool definitions with their descriptions and parameter schemas.

The Event Dispatch Pattern

SubmitEventTool deserves a closer look because it maps a single tool to seven different service endpoints:

Map<String, Object> dispatch(String eventType, Map<String, Object> payload) {
    return switch (eventType) {
        case "CreateCustomer"  -> serviceClient.createEntity("customer", payload);
        case "CreateProduct"   -> serviceClient.createEntity("product", payload);
        case "CreateOrder"     -> serviceClient.createEntity("order", payload);
        case "CancelOrder"     -> {
            String orderId = requireField(payload, "orderId");
            yield serviceClient.performAction("order", orderId, "cancel", payload, true);
        }
        case "ReserveStock"    -> {
            String productId = requireField(payload, "productId");
            yield serviceClient.performAction("product", productId, "stock/reserve", payload, false);
        }
        case "ProcessPayment"  -> serviceClient.createEntity("payment", payload);
        case "RefundPayment"   -> {
            String paymentId = requireField(payload, "paymentId");
            yield serviceClient.performAction("payment", paymentId, "refund", payload, false);
        }
        default -> throw new IllegalArgumentException("Unknown event type: " + eventType);
    };
}

The alternative would be seven separate tools — create_customer, create_product, and so on. We went with a single submit_event tool with an eventType discriminator because it mirrors the event sourcing model (the system is event-driven, the tool should feel event-driven), it keeps the total tool count at ten instead of sixteen, and the AI handles the dispatch naturally. When you say “create a customer named Alice,” it maps that to eventType=”CreateCustomer” without difficulty.

The Demo Tool

RunDemoTool is the most complex tool because each scenario chains multiple service calls:

private Map<String, Object> runHappyPath() {
    // Step 1: Create customer
    Map<String, Object> customer = serviceClient.createEntity("customer", Map.of(
            "name", "Demo Customer",
            "email", "demo-" + shortId() + "@example.com",
            "address", "123 Demo Street"
    ));

    // Step 2: Create product
    Map<String, Object> product = serviceClient.createEntity("product", Map.of(
            "sku", "DEMO-" + shortId(),
            "name", "Demo Widget",
            "price", "29.99",
            "quantityOnHand", 100
    ));

    // Step 3: Create order (uses IDs from previous steps)
    String customerId = extractId(customer, "customerId");
    String productId = extractId(product, "productId");
    Map<String, Object> order = serviceClient.createEntity("order", Map.of(
            "customerId", customerId,
            "customerName", "Demo Customer",
            "lineItems", List.of(Map.of(
                    "productId", productId,
                    "productName", "Demo Widget",
                    "quantity", 2,
                    "unitPrice", 29.99
            ))
    ));

    return Map.of("scenario", "happy_path", "steps", List.of(...));
}

Each scenario uses shortId() — a UUID fragment — so you can run the same scenario multiple times without naming collisions. The payment_failure scenario creates a $16,500 order that exceeds the $10,000 payment limit, triggering saga compensation. The saga_timeout scenario creates an order with minimal stock, designed to hit the deadline. These are pre-built investigation targets — the AI equivalent of a test fixture.

Stdio vs. SSE: Two Transport Modes

Default: stdio (Local Development)

# application.properties
spring.main.web-application-type=none
spring.ai.mcp.server.name=ecommerce-mcp-server

The AI assistant launches the server as a subprocess and communicates via stdin/stdout using JSON-RPC:

stdio transport: Claude Code spawns the MCP server as a java -jar subprocess and communicates over stdin and stdout using JSON-RPC 2.0

No network port needed. This is the default for local development with Claude Code or Claude Desktop.

Docker: SSE/HTTP (Networked Deployment)

# application-docker.properties
spring.main.web-application-type=servlet
spring.ai.mcp.server.stdio=false
server.port=8085

In Docker, the MCP server runs as a web service with Server-Sent Events on port 8085:

mcp-server:
  build: ../mcp-server
  ports:
    - "8085:8085"
  environment:
    - SPRING_PROFILES_ACTIVE=docker
    - MCP_SERVICES_ACCOUNT_URL=http://account-service:8081
    - MCP_SERVICES_INVENTORY_URL=http://inventory-service:8082
    - MCP_SERVICES_ORDER_URL=http://order-service:8083
    - MCP_SERVICES_PAYMENT_URL=http://payment-service:8084

The profile switch is the only difference between the two modes. Same tool code, same behavior.

Testing

Each tool has unit tests that mock ServiceClientOperations:

@ExtendWith(MockitoExtension.class)
class QueryViewToolTest {

    @Mock
    private ServiceClientOperations serviceClient;

    private QueryViewTool queryViewTool;

    @BeforeEach
    void setUp() {
        queryViewTool = new QueryViewTool(serviceClient);
    }

    @Test
    void shouldQueryByKey() throws JsonProcessingException {
        when(serviceClient.getEntity("customer", "c1"))
                .thenReturn(Map.of("customerId", "c1", "name", "Alice"));

        String result = queryViewTool.queryView("customer", "c1", null);

        verify(serviceClient).getEntity("customer", "c1");
        Map<String, Object> parsed = objectMapper.readValue(result, new TypeReference<>() {});
        assertNotNull(parsed.get("customerId"));
    }
}

Eleven test classes cover all ten tools plus the ServiceClient. Add another six for the security layer (more on that below) and one integration suite, and the mcp-server module sits at 143 tests total.

Integration tests use Spring’s ApplicationContextRunner to verify bean wiring without starting the MCP stdio transport (which would block in a test environment):

@DisplayName("MCP Tool Integration")
class McpToolIntegrationTest {

    private final ApplicationContextRunner contextRunner = new ApplicationContextRunner()
            .withConfiguration(AutoConfigurations.of(McpToolConfig.class))
            .withUserConfiguration(TestServiceClientConfig.class)
            .withBean(McpServerProperties.class);

    @Test
    void shouldCreateAllToolBeans() {
        contextRunner.run(context -> {
            assertThat(context).hasSingleBean(QueryViewTool.class);
            assertThat(context).hasSingleBean(SubmitEventTool.class);
            // ... all 10 tools
        });
    }

    @Test
    void shouldRegisterToolCallbackProvider() {
        contextRunner.run(context -> {
            ToolCallbackProvider provider = context.getBean(ToolCallbackProvider.class);
            assertThat(provider.getToolCallbacks()).hasSize(10);
        });
    }
}

Configuration

The MCP server has exactly four configuration properties:

mcp.services.account-url=http://localhost:8081
mcp.services.inventory-url=http://localhost:8082
mcp.services.order-url=http://localhost:8083
mcp.services.payment-url=http://localhost:8084

In Docker, these are overridden by environment variables pointing to container hostnames. That’s it. No Hazelcast configuration, no cluster membership, no pipeline setup.

Setting Up the AI Assistant

Claude Code

Add to ~/.claude/claude_code_config.json:

{
  "mcpServers": {
    "hazelcast-ecommerce": {
      "command": "java",
      "args": ["-jar", "mcp-server/target/mcp-server-1.0.0-SNAPSHOT.jar"]
    }
  }
}

Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "hazelcast-ecommerce": {
      "command": "java",
      "args": ["-jar", "/absolute/path/to/mcp-server-1.0.0-SNAPSHOT.jar"]
    }
  }
}

Docker (SSE)

Point the AI assistant to the SSE endpoint: http://localhost:8085/mcp

In all three cases, the AI discovers the ten tools via tools/list and starts using them immediately.

A Real Conversation

Here’s what a debugging session looks like with MCP connected:

You: “Something’s wrong with recent orders. Can you check?”

Assistant: calls list_sagas(status=”FAILED”)

calls get_metrics()

“I found 2 failed sagas out of 142 total. Your overall success rate is 94.4%. Let me investigate the failures.”

calls inspect_saga(sagaId=”saga-fail-1″)

calls inspect_saga(sagaId=”saga-fail-2″)

“Both failures are payment rejections:

saga-fail-1 (Order ORD-088): Total $15,000 exceeded the $10,000 payment limit. Stock was released via compensation.
saga-fail-2 (Order ORD-091): Total $16,500, same payment limit issue. Stock also released.

These aren’t bugs — the payment service is correctly rejecting orders above the limit. You might want to either raise the limit or add client-side validation to prevent orders over $10,000 from being submitted.”

Five tool calls, zero curl commands, and a root-cause analysis with a recommendation. From one question. I’ll be honest — the first time I watched the AI chain together the right sequence of calls and arrive at a correct diagnosis, it felt a little eerie. Like watching someone drive your car better than you do.

Authentication and Tool Authorization

The first version of this server had no authentication, which is fine for local development and obviously not fine for anything else. So we’ve added API key authentication and role-based tool access — disabled by default to preserve backward compatibility, and enabled with a single property when you need it.

mcp:
  security:
    enabled: true
    api-keys:
      viewer-key-12345: VIEWER
      operator-key-67890: OPERATOR
      admin-key-99999: ADMIN

In HTTP/SSE mode the key arrives in the X-API-Key request header. In stdio mode it’s read from the MCP_API_KEY environment variable. Either way, the server resolves the key to a role, and a ToolAuthorizer checks whether the role is permitted to invoke the tool the AI just asked for.

Three roles are defined:

VIEWER — Read-only. Can call query_view, get_event_history, inspect_saga, list_sagas, get_metrics, list_dlq_entries, and inspect_dlq_entry. Cannot modify state.
OPERATOR — Read plus write. Adds submit_event, run_demo, and replay_dlq_entry.
ADMIN — Same as OPERATOR today, reserved for future admin-only tools.

run_demo is a good example of why the role split matters — it’s the kind of tool you absolutely do not want firing in production, and the default VIEWER key keeps that off the table. The viewer can do everything an SRE wants to do during an incident — query, inspect, look at metrics — but it can’t accidentally place an order.

One layer is still missing: the MCP server authenticates its callers, but it doesn’t forward caller identity to the downstream microservices. For a real production deployment you’d want both. We’ll come back to that.

Where This Goes Next

A few directions we haven’t explored yet.

MCP supports streaming responses, which we’d want for large result sets — listing thousands of events as a single JSON blob isn’t great. MCP also has resources, read-only data endpoints that the AI can reference as context without explicitly calling a tool. The materialized views are a natural fit for that.

OAuth forwarding is the gap mentioned above — the MCP server’s caller identity needs to propagate down to the backend services if we want end-to-end auth in production. The plumbing exists in Spring Security; we just haven’t wired it up.

And with the MCP server as a foundation, you could build specialized AI agents — an operations agent that monitors sagas and flags anomalies, a demo agent that walks users through the system, a testing agent that creates targeted test data and verifies compensation paths. We haven’t built any of these yet, but the tool layer is there.

The MCP server adds a natural-language interface to everything we’ve built so far. Ten tools, a thin REST proxy, two transport modes, role-based authorization, 143 tests. It doesn’t add new capabilities to the data layer — it makes the existing capabilities accessible through conversation. And that turns out to matter more than it sounds like it should. The investigation that took five curl commands now takes one sentence. The demo that required a script and documentation now requires “show me the happy path.” The system that was only inspectable by people who knew the API endpoints is now inspectable by anyone who can ask a question.

That’s where we’ll leave things for today.

Next up: Circuit Breakers and Retry for Saga Resilience

Previous: Vector Similarity Search with Hazelcast

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

June 8, 2026

On the Vector Store I Didn’t Ask For

A short interstitial in the “Building Event-Driven Microservices with Hazelcast” series

AI has been instrumental in bringing this project to fruition — I’m not making any secret of that. The first three posts in this series describe work that was largely pre-existing demo code: domain objects, the Jet pipeline, the materialized view machinery. Claude polished what was already there and helped me write about it. Honest work, but mostly cleanup.

The saga post (post 4) marked a shift — that’s where the demo’s functionality moved into genuinely new territory. And because Hazelcast had recently added a VectorCollection data structure and vector search capability — still in beta at the time — I was eager to incorporate it. So I asked Claude to design and implement something. I should have kept a close eye at every stage; instead I took more of an “I’ll review everything when you’re done” approach.

I was in for a surprise.

What came back was a working vector search implementation. What did not come back was anything built on Hazelcast’s VectorCollection. Claude had built one from scratch — an IMap<String, float[]> for the embeddings, brute-force cosine similarity at query time. No HNSW indexing, no clever data structure, just compute the distance to every vector and sort the results. It worked. The “similar products” endpoint returned plausibly similar products.

This is exactly the thing creating so much fear and doomsaying around AI in the industry. If a coding assistant can reproduce the functionality of an Enterprise software feature — Enterprise edition, additional license cost — in a few hours, is all enterprise software an endangered species?

Not quite. Brute-force cosine similarity is O(n) per query — fine for a demo catalog, fine for a small product line, but not the same animal as Hazelcast’s Enterprise VectorCollection, which uses HNSW indexing to stay sub-millisecond at millions of vectors. That’s real engineering, and it took the Hazelcast team a lot longer than a few hours.

What’s more interesting is that I ended up with both. The accidental implementation became the Community Edition fallback in the framework. The Enterprise implementation took over once I corrected course and built what I’d originally asked for. So the framework now has a VectorStoreService interface with two backends — Enterprise gets HNSW, Community gets brute force, and both work. The Community story is no longer “vector search doesn’t work without a license”; it’s “vector search works fine for modest workloads without a license, and scales seriously if you upgrade.”

I’m not sure I’d have ended up there if Claude had built what I asked for the first time.

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

June 1, 2026

Vector Similarity Search on Hazelcast with HNSW

Part 5 in the “Building Event-Driven Microservices with Hazelcast” series

So far we’ve built an event sourcing framework, a Jet pipeline, materialized views, and a saga pattern for distributed transactions. All of that gives us a solid eCommerce backend where orders flow through services, stock gets reserved, payments get processed, and everything recovers gracefully when something goes wrong.

Now we’re going to add something different: “Find me products similar to this one.”

You’ve seen this everywhere. Netflix’s “Because you watched…” Spotify’s discovery playlists. Amazon’s “Customers who bought this also bought…” These features run on vector embeddings — numerical representations of items positioned in high-dimensional space so that similar items cluster together. It sounds exotic, but by the end of this post we’ll have it working in our framework with a real embedding model running locally, no API keys required.

Why Not Just Use Full-Text Search?

If you’ve been doing this for a while, your reflex for “find similar products” is probably full-text search. Elasticsearch, Solr, maybe Postgres full-text indexes. And those are genuinely good tools for what they do — if someone types “gaming laptop,” full-text search finds documents containing the words “gaming” and “laptop.”

But try searching for “portable computer for games.” Or “high-performance notebook for esports.” Semantically identical. Zero shared keywords. Full-text search won’t connect them because it’s matching tokens, not meaning.

Embedding-based similarity works at a different level entirely. A trained model — we’re using all-MiniLM-L6-v2 — has learned from millions of text pairs that “gaming laptop” and “portable computer for games” mean the same thing. It places them near each other in vector space regardless of whether the words overlap. The model doesn’t care about your vocabulary. It cares about your intent.

In production you’d probably combine both approaches: full-text for keyword lookups and structured queries, vector similarity for the “more like this” discovery path. But for recommendations and product discovery, embeddings are the right tool.

What Are Vector Embeddings?

A vector embedding is a fixed-size array of floats that captures an item’s semantic characteristics. Items with similar meaning end up with vectors pointing in similar directions:

A 2D projection of a 384-dimension semantic vector space: Gaming Laptop, Gaming Desktop, and the phrase portable computer for games cluster together with high cosine similarity, while Running Shoes sits far away with low similarity

“Gaming Laptop” and “Gaming Desktop” are nearby in 384-dimensional space. “Running Shoes” is off in a different neighborhood. You measure similarity by computing the cosine of the angle between two vectors — vectors pointing the same direction score close to 1.0, perpendicular vectors score 0, opposite vectors score -1.0.

Making Embeddings Pluggable

We don’t want the framework married to a specific embedding model. Maybe you’re fine with the default local model. Maybe you need OpenAI’s embeddings for higher accuracy on your domain. Maybe you’ve trained your own. So embedding generation is behind an interface:

public interface EmbeddingProvider {
    float[] embed(String text);
    int getDimension();
    String getModelName();
}

embed() takes text, returns a vector. getDimension() is there so callers can verify compatibility with the vector store’s configured dimension — if you swap models and forget to update the config, you want a clear error, not a silent data corruption. getModelName() is just for logging.

The Default Model

Out of the box, the framework uses LangChain4j’s all-MiniLM-L6-v2. It’s a sentence transformer that runs locally via ONNX Runtime — no API key, no external service, no per-call cost. It produces 384-dimension vectors and captures genuine semantic similarity.

public class LangChain4jEmbeddingProvider implements EmbeddingProvider {

    private final AllMiniLmL6V2EmbeddingModel model;

    public LangChain4jEmbeddingProvider() {
        this.model = new AllMiniLmL6V2EmbeddingModel();
    }

    @Override
    public float[] embed(final String text) {
        return model.embed(text).content().vector();
    }

    @Override
    public int getDimension() {
        return 384;
    }

    @Override
    public String getModelName() {
        return "all-MiniLM-L6-v2";
    }
}

The ONNX runtime takes about a second to load on first call, then it’s thread-safe and fast. The auto-configuration creates it as a @ConditionalOnMissingBean — define your own EmbeddingProvider bean and the default steps aside.

I want to be clear: this is a real model, not a demo placeholder. “Gaming Laptop” and “Portable Computer for Games” genuinely show up as similar even though they share almost no words.

HNSW: Searching Vectors Without Scanning Everything

Once you have vectors, you need to search them. The brute-force approach compares your query vector against every stored vector — O(n) per query. Fine for hundreds of products. Not fine for a million.

HNSW (Hierarchical Navigable Small World) is the standard answer. It builds a multi-layer graph over your vector space — think of it like a skip list but for geometric proximity. The top layers have sparse, long-range connections for coarse navigation. The bottom layers have dense, short-range connections for precision. You search by starting at the top, greedily navigating toward the query vector, then dropping down to finer layers. The result is O(log n) search with high recall.

There are three tuning knobs:

Parameter	Controls	Default
maxDegree (M)	Max edges per graph node	16
efConstruction	Beam width during index build — higher means better recall, slower build	200
metric	Distance function: COSINE, DOT, or EUCLIDEAN	COSINE

For a product catalog, the defaults are fine. You’d tune these if you were indexing millions of items and needed to trade off recall against memory or build time.

Storing and Searching Vectors

The vector store is exposed through VectorStoreService:

public interface VectorStoreService {

    void storeEmbedding(String id, float[] embedding, Map<String, Object> metadata);

    List<SimilarityResult> findSimilar(float[] queryVector, int limit);

    List<SimilarityResult> findSimilarById(String id, int limit);

    boolean isAvailable();

    String getImplementationType();
}

Notice it takes float[]. The vector store doesn’t know or care how the embeddings were generated — the EmbeddingProvider produces vectors, the VectorStoreService stores and searches them. Two concerns, cleanly separated.

SimilarityResult is just a record: (String id, float score, Map<String, Object> metadata).

The Enterprise Path: VectorCollection

Hazelcast Enterprise has a native VectorCollection data structure with built-in HNSW indexing. Our HazelcastVectorStoreService wraps it:

public HazelcastVectorStoreService(HazelcastInstance hazelcast,
                                   VectorStoreProperties properties) {
    this.indexName = properties.getIndexName();

    Metric metric = Metric.valueOf(properties.getMetric().toUpperCase());

    VectorIndexConfig indexConfig = new VectorIndexConfig()
            .setName(indexName)
            .setDimension(properties.getDimension())
            .setMetric(metric)
            .setMaxDegree(properties.getMaxConnections())
            .setEfConstruction(properties.getEfConstruction());

    VectorCollectionConfig collectionConfig =
            new VectorCollectionConfig(properties.getCollectionName())
                    .addVectorIndexConfig(indexConfig);

    hazelcast.getConfig().addVectorCollectionConfig(collectionConfig);
    this.collection = VectorCollection.getCollection(
            hazelcast, properties.getCollectionName());
}

Storing an embedding wraps the metadata and vector into a VectorDocument:

@Override
public void storeEmbedding(String id, float[] embedding,
                           Map<String, Object> metadata) {
    String jsonMetadata = metadataToJson(metadata);
    VectorDocument<String> document = VectorDocument.of(
            jsonMetadata,
            VectorValues.of(indexName, embedding)
    );
    collection.putAsync(id, document).toCompletableFuture().join();
}

And searching is a single async call to the HNSW index:

@Override
public List<SimilarityResult> findSimilar(float[] queryVector, int limit) {
    SearchResults<String, String> searchResults = collection.searchAsync(
            VectorValues.of(indexName, queryVector),
            SearchOptions.builder()
                    .limit(limit)
                    .includeValue()
                    .build()
    ).toCompletableFuture().join();

    List<SimilarityResult> results = new ArrayList<>();
    for (SearchResult<String, String> hit : searchResults) {
        Map<String, Object> metadata = jsonToMetadata(hit.getValue());
        results.add(new SimilarityResult(hit.getKey(), hit.getScore(), metadata));
    }
    return results;
}

For comparison — brute-force IMap scans are O(n) per query; HNSW is O(log n). For 1,000 products the difference is negligible. For 1,000,000, it’s the difference between usable and not. That same O(n) IMap scan is exactly what the Community fallback uses, which we’ll look at next.

The Community Path: Brute Force, but It Works

Not everyone has Hazelcast Enterprise. The Community fallback is a SimpleVectorStoreService that stores embeddings in an ordinary Hazelcast IMap and answers queries with a brute-force O(n) cosine scan over every stored vector. The EmbeddingProvider still runs — it’s local ONNX, no license needed — and now the vectors it produces actually get stored and searched. It reports isAvailable() = true, so the similarity endpoint returns real results. It’s less scalable than the Enterprise HNSW path — that O(n) scan grows linearly with the catalog — but for development, testing, and small-to-moderate catalogs it’s fully operational.

How It All Wires Together

The module layout keeps the Enterprise dependency isolated:

framework-core/                          (always built)
  └── EmbeddingProvider (interface)
  └── LangChain4jEmbeddingProvider (default, all-MiniLM-L6-v2)
  └── VectorStoreService (interface)
  └── SimpleVectorStoreService (Community fallback, IMap + O(n) cosine scan)
  └── VectorStoreAutoConfiguration

framework-enterprise/                    (only with -Penterprise)
  └── HazelcastVectorStoreService (VectorCollection + HNSW)
  └── EnterpriseVectorStoreAutoConfiguration

The Enterprise auto-configuration uses @AutoConfigureBefore so it registers first. If it creates a HazelcastVectorStoreService bean, the core @ConditionalOnMissingBean sees it and skips the Community fallback. If the Enterprise module isn’t on the classpath — which is the default — the SimpleVectorStoreService takes over.

@Configuration
@EnableConfigurationProperties(VectorStoreProperties.class)
@ConditionalOnBean(HazelcastInstance.class)
public class VectorStoreAutoConfiguration {

    @Bean
    @ConditionalOnMissingBean(EmbeddingProvider.class)
    public EmbeddingProvider embeddingProvider() {
        return new LangChain4jEmbeddingProvider();
    }

    @Bean
    @ConditionalOnMissingBean(VectorStoreService.class)
    public VectorStoreService vectorStoreService(HazelcastInstance hazelcast,
                                                 VectorStoreProperties properties) {
        return new SimpleVectorStoreService(hazelcast, properties);
    }
}

Build with mvn clean install for Community (the default). Build with mvn clean install -Penterprise to include the Enterprise module. The runtime figures out the rest.

This is the same edition-aware pattern we use for other Enterprise-only features — CP Subsystem, HD Memory, TLS. Define the interface in framework-core, put the Enterprise implementation in framework-enterprise, let Spring’s auto-configuration ordering handle the selection. Community Edition always works. Enterprise features activate when you add the module and license.

From Product Creation to Searchable Embedding

When a product is created, the InventoryService builds a text representation from its name, description, and category, then asks the EmbeddingProvider to turn that into a vector:

private void storeProductEmbedding(final Product product) {
    StringBuilder text = new StringBuilder();
    text.append(product.getName());
    if (product.getDescription() != null) {
        text.append(" ").append(product.getDescription());
    }
    if (product.getCategory() != null) {
        text.append(" ").append(product.getCategory());
    }

    float[] embedding = embeddingProvider.embed(text.toString());

    Map<String, Object> metadata = new HashMap<>();
    metadata.put("name", product.getName());
    metadata.put("category", product.getCategory());

    vectorStoreService.storeEmbedding(product.getProductId(), embedding, metadata);
}

The whole thing is wrapped in a try/catch as a best-effort operation. On either edition the embedding gets stored — in the Enterprise VectorCollection or the Community IMap — and if the vector store somehow isn’t available, the failure is swallowed so product creation still succeeds. The similarity feature is additive, not load-bearing.

The REST Endpoint

The Inventory Service exposes a similarity search endpoint:

@GetMapping("/{productId}/similar")
public ResponseEntity<SimilarProductsResponse> findSimilarProducts(
        @PathVariable String productId,
        @RequestParam(defaultValue = "5") int limit) {

    if (!productService.productExists(productId)) {
        return ResponseEntity.notFound().build();
    }

    if (!vectorStoreService.isAvailable()) {
        return ResponseEntity.ok(new SimilarProductsResponse(
            productId, false,
            vectorStoreService.getImplementationType(),
            "Vector similarity search is not available.",
            List.of()
        ));
    }

    List<SimilarityResult> results = vectorStoreService.findSimilarById(productId, limit);

    List<ProductDTO> similarProducts = new ArrayList<>();
    for (SimilarityResult result : results) {
        productService.getProduct(result.id())
            .ifPresent(product -> similarProducts.add(product.toDTO()));
    }

    return ResponseEntity.ok(new SimilarProductsResponse(
        productId, true,
        vectorStoreService.getImplementationType(),
        "Found " + similarProducts.size() + " similar products",
        similarProducts
    ));
}

The response shape is the same regardless of edition — the client gets vectorStoreAvailable: true/false and the getImplementationType() string so it knows which backend answered. Both editions return real similar products with similarity scores; the difference is underneath — Enterprise serves them from an HNSW index in O(log n), while Community runs the brute-force O(n) IMap scan. Same results, different scaling characteristics.

Configuration

All the vector store parameters live under framework.vectorstore in your application YAML:

framework:
  vectorstore:
    collection-name: product-vectors
    dimension: 384               # Must match EmbeddingProvider.getDimension()
    max-connections: 16          # HNSW maxDegree (M)
    ef-construction: 200         # HNSW build beam width
    metric: COSINE               # COSINE, DOT, or EUCLIDEAN
    index-name: default          # HNSW index name

The dimension has to match whatever your EmbeddingProvider produces. The default (384) matches all-MiniLM-L6-v2. If you swap in OpenAI’s text-embedding-3-small (1536 dimensions), update this property or you’ll get index errors that are confusing to debug.

Trying It Out

With the Docker stack running:

# Load sample products (creates embeddings automatically)
./scripts/load-sample-data.sh

# Find products similar to a known product
curl http://localhost:8082/api/products/<product-id>/similar?limit=5

# Or run the demo scenario
./scripts/demo-scenarios.sh 4

Demo scenario 4 detects the edition, looks up a product, calls the similarity endpoint, and displays results with appropriate messaging for either edition.

What’s Next

Bring your own model. The EmbeddingProvider interface makes swapping models easy. Define a bean, return vectors, done:

@Bean
public EmbeddingProvider embeddingProvider() {
    return new EmbeddingProvider() {
        private final OpenAiEmbeddingModel model = /* your config */;

        @Override
        public float[] embed(String text) {
            return model.embed(text).content().vector();
        }

        @Override
        public int getDimension() { return 1536; }

        @Override
        public String getModelName() { return "text-embedding-3-small"; }
    };
}

OpenAI, Cohere, any Sentence-Transformer via LangChain4j’s ONNX integration — just update framework.vectorstore.dimension to match.

Hybrid search is the obvious next step: “find products similar to this laptop, but only in Electronics and under $1000.” That combines a VectorCollection similarity search with IMap predicate filtering. It’s also exactly the kind of natural-language query that works well with AI-driven orchestration — an LLM can decompose that request into a similarity lookup plus attribute filters, call the right APIs, and merge the results. We’ll get into that in an upcoming post on the Model Context Protocol (MCP), which gives AI models structured access to our microservices.

Multi-modal search is possible too. Hazelcast’s VectorCollection supports multiple named indexes on a single collection — one for text embeddings, another for image embeddings. Same data structure, different similarity dimensions.

Next up: AI-Powered Microservices with the Model Context Protocol

Previous: Saga Pattern: Distributed Transactions Without 2PC

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

June 1, 2026

On Debugging by Assumption

A short interstitial in the “Building Event-Driven Microservices with Hazelcast” series

“It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.” — Mark Twain

Measurement is better than guessing. Who knew?

When the saga implementation was first finished and we ran through the test scenarios for the first time, there was a high incidence of saga timeouts if we ran for more than a few minutes. (5 minutes was great; 30 minutes was ugly.)

I didn’t ask Claude to investigate, or do any analysis of my own, because I had a pretty good suspicion what was going on. Everything was running on a single laptop — 16GB of memory split across a 3-node Hazelcast cluster, 4 services each running an embedded Hazelcast node, Docker itself, and I really hadn’t bothered to shut off my normal desktop workload. Web browser, email, whatever else. I figured I’d maxed the poor thing out and probably wouldn’t get a clean timeout-free run until I deployed to a multi-node cluster in the cloud.

That didn’t happen for some time. I’m cheap, and I wasn’t going to pay for cloud resources until I had a full-blown demo ready to go. When the day came — much later in the story if I was telling it chronologically — I saw the same pattern of timeouts. Turns out, it was never thread starvation or lack of resources. It was a combination of things, and none of them were what I’d assumed.

When faced with this reality, I asked Claude to troubleshoot the issue, and this is one of the times I was most impressed with how Claude approached a problem compared to how I would have.

In most debugging scenarios, I look only until I find the first reasonable suspect. Why keep looking if you’ve already found what you’re looking for? Fix, rebuild, retest, and on a good day, that’s the end of it. On a bad day, you’re still looking at the same issue, so you start hunting for suspect #2. Lather, rinse, repeat.

Claude came back with four identified problems. The main one was subtle: we generated product data up front and gave each item a reasonable starting stock. As the demo ran, orders depleted the stock, and eventually the inventory service started throwing InsufficientStockException — correct behavior, you can’t sell what you don’t have. But the circuit breaker we’d added for resilience was treating that business error the same as an infrastructure failure. Enough “failures” in the sliding window and the circuit breaker tripped open, rejecting all orders — including ones for products that still had stock. Sagas piled up with nowhere to go, the timeout detector found hundreds of them every cycle, and the system drowned in compensation events. At the peak: 64,000 timeouts from 53,000 sagas started.

The other three fixes addressed related gaps. Business failures like out-of-stock now trigger immediate saga compensation instead of waiting for the timeout detector to notice. A NonRetryableException marker interface tells the circuit breaker not to count deterministic business errors against the failure rate. And an automatic stock replenishment monitor keeps the demo in a steady state where orders can actually succeed for hours instead of wedging after the first few minutes.

I should have investigated the saga timeouts when they first appeared, rather than assuming the problem would magically go away with more hardware. And when I did get around to investigating, Claude’s approach of identifying all the contributing problems at once was considerably more effective than my usual one-suspect-at-a-time strategy.

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

June 1, 2026