Author: Mike

Event Sourcing Observability: Prometheus, Grafana, Jaeger

Part 12 in the “Building Event-Driven Microservices with Hazelcast” series

Over the past eleven posts, we’ve built an event sourcing framework, a Jet pipeline, materialized views, sagas, circuit breakers, an outbox, dead letter queues, and durable persistence. That’s a lot of moving parts.

Now: how do you observe what’s happening inside all of them?

Event sourcing changes the observability game. Traditional request-response applications have easy metrics — request rate, error rate, latency. In an event-sourced system, a single API call triggers an asynchronous pipeline that writes to an event store, updates a materialized view, publishes to subscribers, and potentially kicks off a multi-service saga. A latency spike could be hiding in any of those stages. You need to see into all of them.

This post builds a complete observability stack: Prometheus + Micrometer for metrics, Grafana for dashboards and alerting, Jaeger for distributed tracing.

How It Fits Together

Observability stack architecture: four Spring Boot services on ports 8081 through 8084 expose /actuator/prometheus endpoints that Prometheus scrapes every 15 seconds on port 9090, and export OTLP spans to Jaeger on port 16686; Grafana on port 3000 reads from Prometheus and renders five auto-provisioned dashboards with six alerts

Each service exposes /actuator/prometheus. Prometheus scrapes all four every 15 seconds. Grafana reads from Prometheus and renders dashboards. Jaeger collects distributed traces via OTLP.

Instrumenting the Framework

Metrics Architecture

An event-sourced system has a lot of moving parts, and each one needs its own instrumentation. The framework provides roughly 70 metrics across a dozen categories. They’re organized around two core subsystems — the event pipeline and the saga layer — plus several supporting categories for everything else.

Pipeline Metrics

PipelineMetrics tracks every event through the 6-stage pipeline:

public class PipelineMetrics {

    private final MeterRegistry registry;
    private final String domainName;

    // Events entering and leaving the pipeline
    Counter eventsReceived;   // "eventsourcing.pipeline.events.received"
    Counter eventsProcessed;  // "eventsourcing.pipeline.events.processed"
    Counter eventsFailed;     // "eventsourcing.pipeline.events.failed"

    // End-to-end latency histogram with percentiles
    Timer endToEndLatency;    // "eventsourcing.pipeline.latency.end_to_end"
    Timer queueWaitLatency;   // "eventsourcing.pipeline.latency.queue_wait"

    // Per-stage timing
    Timer stageDuration;      // "eventsourcing.pipeline.stage.duration"
                              // Tagged with stage: persist, update_view, publish
}

Every metric is tagged with domain (e.g., “Customer”, “Order”) and eventType (e.g., “CustomerCreated”), so you can filter as narrowly as you need:

Counter.builder("eventsourcing.pipeline.events.processed")
    .tag("domain", domainName)
    .tag("eventType", eventType)
    .register(registry);

The per-stage timer is the one I find most useful for debugging. If P99 spikes, you can see which stage is the bottleneck — is it the event store write, the view update, or the publication step?

public enum PipelineStage {
    SOURCE("source"),
    ENRICH("enrich"),
    PERSIST("persist"),
    UPDATE_VIEW("update_view"),
    PUBLISH("publish"),
    COMPLETE("complete")
}

public void recordStageTiming(PipelineStage stage, Instant start) {
    Timer.builder("eventsourcing.pipeline.stage.duration")
        .tag("domain", domainName)
        .tag("stage", stage.getLabel())
        .publishPercentiles(0.5, 0.95, 0.99)
        .register(registry)
        .record(Duration.between(start, Instant.now()));
}

Saga Metrics

SagaMetrics tracks the lifecycle of distributed sagas:

public class SagaMetrics {

    private static final String PREFIX = "saga";

    // Lifecycle counters (tagged by sagaType)
    "saga.started"               // Sagas initiated
    "saga.completed"             // Sagas completed successfully
    "saga.compensated"           // Sagas that required compensation
    "saga.failed"                // Sagas that failed
    "saga.timedout"              // Sagas that exceeded their deadline

    // Step-level counters
    "saga.steps.completed"       // Individual steps completed
    "saga.steps.failed"          // Individual steps failed
    "saga.compensation.started"  // Compensation processes initiated
    "saga.compensation.steps"    // Compensation steps executed

    // Duration timers (p50, p95, p99)
    "saga.duration"              // End-to-end saga duration
    "saga.compensation.duration" // Compensation duration
}

Beyond the Core

Pipeline and saga metrics tell you how events flow and how transactions coordinate. But a production system has more to watch. The framework instruments several additional subsystems:

The outbox pattern guarantees at-least-once delivery to the shared cluster. Metrics track entries written, claimed, delivered, and failed. If outbox.entries.written is climbing faster than outbox.entries.delivered, your delivery pipeline is falling behind.

Events that fail delivery repeatedly land in the DLQ. Three counters — dlq.entries.added, dlq.entries.replayed, dlq.entries.discarded — tell you whether poison messages are accumulating or getting resolved.

When PostgreSQL persistence is enabled, write/read latency, batch sizes, and error counts are tracked per map. High persistence.store.duration points to database bottlenecks.

The idempotency guard tracks duplicate detection tagged hit (duplicate blocked) or miss (new event). A high hit ratio under normal operation is actually good news — it means at-least-once delivery is working and duplicates are being caught.

Circuit breaker state, failure rates, and retry outcomes from Resilience4j are exposed automatically. When resilience4j_circuitbreaker_state flips to OPEN, a downstream service is in trouble.

And then there are business metrics — revenue, order item counts, customer totals, inventory replenishment. These are the ones business stakeholders actually care about. They bridge the gap between “pipeline is fast” and “orders are generating revenue.”

The complete catalog of every metric name, tag, Prometheus mapping, and troubleshooting guide is in the Metrics Reference Guide.

Caching Metric Instances

At 100,000+ events per second, looking up a Counter in Micrometer’s registry on every event is measurable overhead. SagaMetrics caches metric instances in a ConcurrentHashMap:

private final ConcurrentMap<String, Counter> counterCache;
private final ConcurrentMap<String, Timer> timerCache;

private Counter getCounter(String name, String sagaType) {
    String key = name + ":" + sagaType;
    return counterCache.computeIfAbsent(key, k ->
            Counter.builder(PREFIX + "." + name)
                    .tag("sagaType", sagaType)
                    .register(meterRegistry)
    );
}

Small detail, but at high throughput every nanosecond in the hot path matters.

JVM and System Metrics

Beyond application metrics, the framework auto-registers JVM metrics via MetricsConfig:

@Configuration
public class MetricsConfig {

    @Bean public JvmMemoryMetrics jvmMemoryMetrics()   { return new JvmMemoryMetrics(); }
    @Bean public JvmGcMetrics jvmGcMetrics()           { return new JvmGcMetrics(); }
    @Bean public JvmThreadMetrics jvmThreadMetrics()    { return new JvmThreadMetrics(); }
    @Bean public ClassLoaderMetrics classLoaderMetrics() { return new ClassLoaderMetrics(); }
    @Bean public ProcessorMetrics processorMetrics()    { return new ProcessorMetrics(); }
}

Common tags applied to every metric enable cross-service filtering:

registry.config().commonTags(Arrays.asList(
    Tag.of("application", applicationName),
    Tag.of("version", applicationVersion)
));

Every metric — JVM heap usage, saga completion rates, business revenue — can be filtered by service name in Grafana.

Configuring Prometheus

Service-Side: Spring Boot Actuator

Each service exposes metrics via Actuator with Prometheus export:

# application.yml
management:
  endpoints:
    web:
      exposure:
        include: health,prometheus,metrics,info
  endpoint:
    health:
      show-details: always
  metrics:
    export:
      prometheus:
        enabled: true
    tags:
      application: ${spring.application.name}

This creates the /actuator/prometheus endpoint that Prometheus scrapes.

Prometheus-Side: Scrape Configuration

Prometheus scrapes all four services and the Hazelcast cluster:

# docker/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'ecommerce-demo'

scrape_configs:
  - job_name: 'account-service'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['account-service:8081']

  - job_name: 'inventory-service'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['inventory-service:8082']

  - job_name: 'order-service'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['order-service:8083']

  - job_name: 'payment-service'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['payment-service:8084']

  - job_name: 'hazelcast'
    metrics_path: '/hazelcast/rest/cluster'
    static_configs:
      - targets:
          - 'hazelcast-1:5701'
          - 'hazelcast-2:5701'
          - 'hazelcast-3:5701'

Each service gets its own job, so Prometheus labels metrics with job=”account-service” etc. automatically.

Grafana Dashboards

Dashboard Strategy

Five pre-provisioned dashboards, each focused on a different operational concern:

Dashboard	Focus	Key Question It Answers
System Overview	Service health and throughput	“Is everything running? How much traffic?”
Event Flow	Pipeline performance	“How fast are events processing? Where are bottlenecks?”
Materialized Views	View update performance	“Are views keeping up with events?”
Saga Dashboard	Distributed transaction health	“Are sagas completing? Any failures or timeouts?”
Business Overview	Revenue, orders, customers	“Is the business healthy? Are orders generating revenue?”

System Overview is the home dashboard — first thing you see when you open Grafana.

Auto-Provisioning

Dashboards, datasources, and alerts are all provisioned automatically. When Grafana starts, it reads configuration from mounted volumes:

docker/grafana/
├── dashboards/                    # Dashboard JSON files
│   ├── system-overview.json       # Auto-loaded as home dashboard
│   ├── event-flow.json
│   ├── materialized-views.json
│   ├── saga-dashboard.json
│   └── business-overview.json
└── provisioning/
    ├── datasources/
    │   └── datasources.yml        # Points to Prometheus
    ├── dashboards/
    │   └── dashboards.yml         # Tells Grafana where to find JSONs
    └── alerting/
        ├── alerts.yml             # Alert rule definitions
        ├── contactpoints.yml      # Notification channels
        └── policies.yml           # Routing policies

No manual setup. docker-compose up and the dashboards are ready.

Key Dashboard Panels

System Overview

At-a-glance health for the whole system: service health indicators (green/red per service based on up{job=”…”}), event throughput by service, HTTP request rates, pipeline P95 latency, and a saga summary showing started, completed, failed, and timed out counts.

Saga Dashboard

Deep visibility into distributed transactions. Active saga count and compensating count — how many are in flight right now. Throughput charts for start, complete, and compensate rates, filterable by sagaType. Duration percentiles at P50, P95, P99. Success rate as a percentage. Timeout detection rate. Compensation breakdown — are any compensation steps failing?

The dashboard supports a $sagaType variable, so you can filter to just “OrderFulfillment” or “OrderFulfillmentOrchestrated” or view everything at once.

Event Flow

The pipeline performance dashboard: events published per second by service, end-to-end latency percentiles, queue wait latency (are events sitting around before processing starts?), a stacked stage duration breakdown at P95 for persist, update_view, and publish, and failed events by stage and type.

Business Overview

This one bridges technical and business concerns. Cumulative revenue over time, order rate with item counts, customer growth, saga success rate (what percentage of orders complete without compensation?), and end-to-end saga duration.

Alerting

Pre-Configured Alerts

Six alerts that cover the most common failure modes:

Saga Alerts

Alert	Severity	Condition	For
High Saga Failure Rate	Critical	increase(saga_failed_total[5m]) > 0	2 min
Saga Timeouts Detected	Warning	increase(saga_timeouts_detected_total[5m]) > 0	2 min
Saga Compensation Failures	Critical	increase(saga_compensations_failed_total[5m]) > 0	1 min
Low Saga Success Rate	Warning	Success rate < 90% over 10 minutes	5 min

Service Health Alerts

Alert	Severity	Condition	For
Service Down	Critical	up < 1 for any service	1 min
High Event Processing Error Rate	Warning	Error rate > 5% over 5 minutes	3 min

The “For” duration prevents flapping — a brief network blip won’t page you at 3am. Compensation failures fire fastest (1 minute) because a failed compensation means money or inventory is in an inconsistent state.

Distributed Tracing with Jaeger

Metrics tell you that something is slow. Tracing tells you why.

In our system, a single order placement can touch four services: Order creates the order, Inventory reserves stock, Payment processes the charge, Order confirms. With metrics alone, you see “P99 saga duration increased.” With tracing, you see “Payment Service is taking 2 seconds to respond to StockReserved events.” That’s the difference between knowing there’s a problem and knowing where it is.

Configuration

Tracing is enabled via Spring Boot’s OpenTelemetry integration:

management:
  tracing:
    enabled: true
    sampling:
      probability: 1.0   # Sample 100% of requests (reduce in production)
  otlp:
    tracing:
      endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT:http://localhost:4317}

Jaeger runs as an all-in-one container in the Docker stack, receiving traces via OTLP on port 4317. In the Jaeger UI (http://localhost:16686), you pick a service, find traces for a time window, click into a trace to see the span waterfall across all four services, and identify which operation is contributing to latency.

The most valuable traces in an event-sourced system: the full path from API request to pipeline completion, saga flows (OrderCreated → StockReserved → PaymentProcessed → OrderConfirmed), and compensation flows — where did the failure occur, and how long did compensation take?

Useful PromQL Queries

Queries you can run in Prometheus or use in custom Grafana panels:

Service Health

# Are all services up?
up{job=~".*-service"}

# HTTP request rate by service
sum by (application) (rate(http_server_requests_seconds_count[5m]))

Event Pipeline

# Event throughput
rate(eventsourcing_pipeline_events_processed_total[5m])

# End-to-end P99 latency
histogram_quantile(0.99, rate(eventsourcing_pipeline_latency_end_to_end_seconds_bucket[5m]))

# Queue wait time (events waiting to be processed)
histogram_quantile(0.95, rate(eventsourcing_pipeline_latency_queue_wait_seconds_bucket[5m]))

Sagas

# Saga completion rate
rate(saga_completed_total[5m])

# Saga success rate (percentage)
sum(saga_completed_total) /
  (sum(saga_completed_total) + sum(saga_compensated_total) +
   sum(saga_failed_total) + sum(saga_timedout_total))

# Saga duration P99
histogram_quantile(0.99, rate(saga_duration_seconds_bucket[5m]))

# Active timeouts in last 5 minutes
increase(saga_timeouts_detected_total[5m])

Business

# Cumulative revenue
order_revenue_total

# Orders per second
rate(order_items_count_total[5m])

# Current customer count
account_customers_total

JVM

# Heap memory usage
jvm_memory_used_bytes{area="heap"}

# GC pause time
rate(jvm_gc_pause_seconds_sum[5m])

Lessons Learned

Don’t just measure HTTP latency. In an event-sourced system, the interesting latency is inside the pipeline — from event submission to view update. HTTP latency includes that but hides where the time is spent.

Multi-dimensional tags (domain, eventType, sagaType, stage) are not optional. A P99 spike in “pipeline latency” is useless without knowing which domain and stage are affected.

Cache your Counter and Timer instances at high throughput. Registry lookups add up. ConcurrentHashMap.computeIfAbsent works well.

Provision everything as code. Don’t create dashboards by hand — provision them from JSON files. Your observability stack is version-controlled, reproducible, and deploys automatically. When someone clones the repo and runs docker-compose up, they get the same dashboards as everyone else.

Alert on business outcomes, not just infrastructure. “Service Down” is an infrastructure alert. “Saga Failure Rate” is a business outcome alert. Both matter, but the business alerts catch problems that don’t manifest as service crashes — like a payment gateway returning errors, causing saga compensations to spike while all four services stay green.

The framework provides roughly 70 metrics organized into a dozen categories — pipeline throughput and per-stage latency, saga lifecycle tracking, outbox delivery, dead letter queues, persistence latency, circuit breaker state, business KPIs, JVM health, HTTP request rates. Combined with auto-provisioned Grafana dashboards, pre-configured alerts, and distributed tracing via Jaeger, you get complete visibility into a system where a single API call can trigger asynchronous processing across four services.

Event sourcing makes observability both harder and more important. Events are asynchronous, distributed, and flow through multiple stages. Without good metrics and dashboards, you’re flying blind. The Metrics Reference Guide has the complete catalog.

Previous: Hazelcast Write-Behind MapStore: Durable Event Sourcing

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

July 21, 2026

Hazelcast Write-Behind MapStore: Durable Event Sourcing

Part 11 in the “Building Event-Driven Microservices with Hazelcast” series

In Part 9 and Part 10, we finished the reliability and coordination layer — dead letter queues, idempotency guards, two saga patterns. But there’s been a fundamental gap this whole time: every piece of data lives exclusively in Hazelcast IMaps. A full cluster restart erases everything. The event store, the materialized views, the saga state. Gone.

For a demo that runs 30 minutes, that’s fine. For a production system — or even a trade show booth running for hours — it’s not. Events are the source of truth in an event-sourced system. Losing them means losing business history.

This post covers how we added durable persistence to the framework, primarily through Hazelcast’s MapStore mechanism but with one notable exception, without changing a single line of service business logic.

The Problem

Our event sourcing pipeline writes to several types of IMaps: the event store (Customer_ES, Product_ES, etc.), materialized views (Customer_VIEW, Order_VIEW, etc.), and supporting maps for saga state, the outbox, and the DLQ. All in-memory. Hazelcast’s IMap is fast precisely because it avoids disk I/O.

But that creates two problems. First, data loss on restart — the event log is gone, you can’t rebuild views or replay events or audit what happened. Second, unbounded memory growth — during a long-running demo, events accumulate indefinitely, the JVM runs out of heap, and the pod gets OOMKilled. We saw this happen at about the 45-minute mark under sustained load.

We need to persist to a durable store (PostgreSQL) while keeping the in-memory performance characteristics intact.

Why MapStore?

Hazelcast’s MapStore interface is the natural integration point. It’s a callback mechanism — Hazelcast calls your code whenever entries are written to or read from an IMap:

Write-behind MapStore data flow: the service calls IMap.put on the Hazelcast IMap and returns immediately with no database wait; the MapStore buffers writes and flushes them to PostgreSQL in asynchronous JDBC batches, while MapLoader reloads entries from PostgreSQL on a cache miss or cold start

We use write-behind mode. Hazelcast buffers writes and flushes them asynchronously in batches. The IMap.put() call returns immediately — the service never waits for PostgreSQL. This matters because our Jet pipeline calls put() on every event, and we can’t afford database latency in the hot path.

There’s also write-through mode (writeDelaySeconds=0), where every put() synchronously writes to the database. We don’t use it. It would negate the entire point of in-memory processing.

The MapStore also implements MapLoader, which Hazelcast calls on cache misses and cold starts. This gives us automatic rehydration: if a service restarts, the views reload from PostgreSQL without any special recovery code. No replay, no rebuild — the data is just there.

Architecture

The persistence layer splits across two modules, with provider-agnostic interfaces in framework-core and database-specific implementations in framework-postgres. The full design rationale is in ADR 012.

Provider-Agnostic Interfaces (framework-core)

Four persistence interfaces, one per map type:

public interface EventStorePersistence {
    void persist(String mapName, PersistableEvent event);
    void persistBatch(String mapName, List<PersistableEvent> events);
    Optional<PersistableEvent> loadEvent(String mapName, String mapKey);
    Iterable<String> loadAllKeys(String mapName);
    void delete(String mapName, String mapKey);
    boolean isAvailable();
}

ViewStorePersistence follows the same shape but uses upsert semantics — newer entries replace older ones for the same key. OutboxStorePersistence adds loadNonDeliveredKeys() for recovering in-flight entries on restart. DlqStorePersistence adds loadPendingKeys() for the same reason.

These interfaces know nothing about Hazelcast, GenericRecord, or Compact serialization. They operate on portable record types — PersistableEvent, PersistableView, PersistableOutboxEntry, PersistableDeadLetterEntry — simple Java records containing strings and longs. Clean enough for any JDBC-compatible database.

MapStore Adapters (framework-core)

EventStoreMapStore, ViewStoreMapStore, and OutboxMapStore implement Hazelcast’s MapStore interface and delegate to the persistence interfaces above. They handle key serialization (converting PartitionedSequenceKey<String> to a string format like seq:12345|key:cust-001), GenericRecord-to-JSON conversion via GenericRecordJsonConverter, and metadata extraction from GenericRecord fields.

There’s no DLQ MapStore adapter. The DLQ can’t use MapStore at all — more on why below.

PostgreSQL Implementation (framework-postgres)

PostgresEventStorePersistence uses JPA for single-record operations and JdbcTemplate.batchUpdate() for batches. Events use ON CONFLICT DO NOTHING (append-only — if it’s already there, leave it alone). Views and outbox entries use ON CONFLICT DO UPDATE (upsert — latest state wins).

Flyway manages the schema:

CREATE TABLE domain_events (
    map_name       VARCHAR(255) NOT NULL,
    map_key        VARCHAR(512) NOT NULL,
    aggregate_id   VARCHAR(255) NOT NULL,
    sequence       BIGINT NOT NULL,
    event_type     VARCHAR(255) NOT NULL,
    event_data     JSONB NOT NULL,
    timestamp_millis BIGINT NOT NULL,
    correlation_id VARCHAR(255),
    created_at     TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (map_name, map_key)
);

In-Memory Fallback (framework-core)

Each of the four persistence interfaces has a ConcurrentHashMap-backed in-memory implementation: InMemoryEventStorePersistence, InMemoryViewStorePersistence, InMemoryOutboxStorePersistence, InMemoryDlqStorePersistence. When framework.persistence.enabled=true but no PostgreSQL driver is on the classpath, the auto-configuration falls back to these. The persistence pipeline runs — good for testing the wiring — without requiring an actual database.

How Write-Behind Works

The timeline of a single event being persisted:

Timeline of a single event: the service receives the request, creates the event, calls IMap.put and responds to the client within milliseconds; roughly five seconds later the Hazelcast write-behind timer fires and the MapStore issues a batched JDBC INSERT into PostgreSQL

The service responds in milliseconds. The database write happens 5 seconds later, batched with other events.

MapStore Behavior by Map Type

Aspect	Event Store (_ES)	View Store (_VIEW)	Outbox (framework_OUTBOX)
Write semantics	INSERT (append-only)	UPSERT (latest wins)	UPSERT (status transitions)
Coalescing	Disabled — each event is unique	Enabled — only latest state per key	Enabled — only latest status per entry
Initial load	LAZY — events loaded on demand	EAGER — all keys loaded on cold start	LAZY — non-delivered entries on demand

Coalescing is worth explaining. If a customer’s address changes three times during the five-second write-behind window, only the final state gets persisted. That’s correct for views — they represent current state, not history. Events are never coalesced because each one is a distinct historical fact. The outbox coalesces because entries transition through statuses (PENDING → CLAIMED → DELIVERED) and only the latest status matters for recovery.

Bounded Memory with Eviction

Persistence unlocks something else: IMap eviction. Without a backing store, evicting an entry means losing it permanently. With a MapStore behind the map, evicted entries can be reloaded on demand via MapLoader.load().

This turns IMaps into bounded hot caches:

framework:
  persistence:
    enabled: true
    event-store-eviction:
      enabled: true
      max-size: 10000        # per node
      eviction-policy: LRU
    view-store-eviction:
      enabled: true
      max-size: 10000
      eviction-policy: LRU
      max-idle-seconds: 3600  # evict views idle > 1 hour

When the map reaches 10,000 entries, the least recently used ones get evicted. If a subsequent get() hits an evicted key, Hazelcast calls MapLoader.load(), reads from PostgreSQL, and puts the entry back. The service code never knows the difference — it’s the same IMap.get() call either way.

Memory stays bounded. No OOMKill after hours of continuous load. Hot data stays in-memory at sub-millisecond latency. Cold data reloads transparently.

The DLQ Exception: Direct Writes Instead of MapStore

The dead letter queue is the one map that can’t use MapStore. The reason traces directly back to the dual-instance Hazelcast architecture.

The event store, view store, and outbox all live on the embedded Hazelcast instance — the standalone one that runs Jet pipelines inside each service. MapStore is a server-side configuration: you attach it to a map on a Hazelcast member, Hazelcast calls your code when entries change. Works great because the embedded instance is a full member that the service controls.

The DLQ lives on the shared cluster — the external 3-node Hazelcast cluster that services connect to as clients. Services write to the DLQ via hazelcastClient. MapStore is configured on the server side, and the shared cluster nodes don’t have the service’s persistence beans. You simply cannot attach a MapStore to a map accessed through a client connection.

So the DLQ does direct persistence writes. When HazelcastDeadLetterQueue.add(), replay(), or discard() is called, it writes to the IMap and calls DlqStorePersistence.persist() in the same method:

public void add(DeadLetterEntry entry) {
    GenericRecord record = toGenericRecord(entry);
    dlqMap.set(entry.getId(), record);
    persistIfAvailable(entry);
}

Synchronous, not write-behind. The trade-off is fine: DLQ entries are rare — they represent failures — so a database write in the hot path is negligible. If persistence itself fails, the entry is still in the IMap. The failure gets logged, but it doesn’t block the DLQ operation.

On startup, loadFromPersistence() hydrates the IMap with PENDING entries from PostgreSQL. Terminal entries (REPLAYED, DISCARDED) aren’t recovered — they’ve already been handled.

Metrics and Observability

Every MapStore operation is instrumented with Micrometer, following the same ConcurrentHashMap-cached counter/timer pattern used by PipelineMetrics and SagaMetrics:

Metric	Type	Description
persistence.store.count	Counter	Single write operations
persistence.store.batch.count	Counter	Batch write operations
persistence.store.batch.entries	Counter	Total entries across all batches
persistence.load.count	Counter	Load operations (cache misses)
persistence.load.miss	Counter	Load misses (not in DB either)
persistence.delete.count	Counter	Delete operations
persistence.errors	Counter	Errors by operation
persistence.store.duration	Timer	Write latency (p50/p95/p99)
persistence.load.duration	Timer	Load latency (p50/p95/p99)

A pre-built Grafana dashboard (persistence-dashboard.json) auto-provisions alongside the existing ones and shows throughput by map, latency percentiles, batch sizes, and error rates.

Metrics are optional — MapStore constructors accept a nullable PersistenceMetrics parameter. No MeterRegistry in the context (unit tests, for instance), no metrics. Nothing breaks.

Zero Code Changes in Services

The design goal I cared about most: enabling persistence shouldn’t require touching business logic. The complete diff in a service’s application.yml:

# Add to any service to enable persistence
framework:
  persistence:
    enabled: true

spring:
  datasource:
    url: jdbc:postgresql://localhost:5432/ecommerce
    username: ecommerce
    password: ecommerce

And add framework-postgres as a Maven dependency. That’s it.

The auto-configuration chain handles everything else. PostgresPersistenceAutoConfiguration detects the PostgreSQL driver and creates persistence beans for all four map types. PersistenceAutoConfiguration creates MapStore adapters for event, view, and outbox maps and wires them to the persistence beans. Each service’s config class detects the adapters via @Autowired(required = false) and attaches them to the IMap configurations. DeadLetterQueueAutoConfiguration passes the optional DlqStorePersistence bean directly to HazelcastDeadLetterQueue. Hazelcast handles the write-behind scheduling, batching, and MapLoader callbacks for the MapStore-backed maps.

If framework-postgres isn’t on the classpath, the in-memory fallback kicks in. If framework.persistence.enabled is false (the default), nothing changes at all.

Custom Providers

Swapping PostgreSQL for another database means implementing four interfaces: EventStorePersistence, ViewStorePersistence, OutboxStorePersistence, and DlqStorePersistence. Create an @AutoConfiguration class with @AutoConfigureBefore(PersistenceAutoConfiguration.class), register the beans as @ConditionalOnMissingBean, and add to AutoConfiguration.imports.

The in-memory implementations are about 50 lines each — they serve as a decent reference.

What We Built

Component	Purpose
EventStorePersistence / ViewStorePersistence	Provider-agnostic interfaces (event store, views)
OutboxStorePersistence / DlqStorePersistence	Provider-agnostic interfaces (outbox, DLQ)
PersistableEvent / PersistableView	Portable records (decoupled from GenericRecord)
PersistableOutboxEntry / PersistableDeadLetterEntry	Portable records (outbox, DLQ)
EventStoreMapStore / ViewStoreMapStore / OutboxMapStore	Hazelcast MapStore adapters (write-behind)
GenericRecordJsonConverter	Compact GenericRecord to/from JSON
PostgresEventStorePersistence / PostgresViewStorePersistence	PostgreSQL implementation (events, views)
PostgresOutboxStorePersistence / PostgresDlqStorePersistence	PostgreSQL implementation (outbox, DLQ)
InMemory*Persistence (x4)	Development/test fallback for all map types
PersistenceProperties	Spring Boot configuration (write-delay, batch size, eviction)
PersistenceMetrics	Micrometer counters and timers
PersistenceAutoConfiguration	Auto-wiring with fallback chain

Configuration Reference

framework:
  persistence:
    enabled: true                  # Master switch (default: false)
    write-delay-seconds: 5         # Batch window (default: 5)
    write-batch-size: 100          # Max entries per batch (default: 100)
    write-coalescing: false        # Coalesce writes (default: false)
    initial-load-mode: LAZY        # LAZY or EAGER (default: LAZY)
    event-store-eviction:
      enabled: true                # Enable eviction (default: true)
      max-size: 10000              # Max entries per node
      max-size-policy: PER_NODE
      eviction-policy: LRU
      max-idle-seconds: 0          # 0 = no idle eviction
    view-store-eviction:
      enabled: true
      max-size: 10000
      max-size-policy: PER_NODE
      eviction-policy: LRU
      max-idle-seconds: 3600       # Evict idle views after 1 hour

The framework now has a complete data lifecycle: events created in-memory for speed, persisted to PostgreSQL for durability, evicted when memory is constrained, reloaded on demand. The in-memory event sourcing performance is unchanged — PostgreSQL is strictly write-behind, never in the hot path.

The Persistence Guide has the full reference including PostgreSQL setup, custom provider implementation, eviction tuning, and troubleshooting.

Previous: Saga Orchestration vs Choreography on Hazelcast

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

July 13, 2026

Claude Code Memory: Context, CLAUDE.md, and When to Clear
Part 3 of a series on what I learned shipping BaseballScorer. Part 1 was the arc; Part 2 was the release workflow and the skills that mechanize it. This one is about memory — and more broadly, about the question every Claude Code user ends up arguing about: what goes in the context window, what goes in CLAUDE.md, what goes in persistent memory, and when to clear the whole thing and start fresh.

In late June, Apple’s upload API lied to me. I ran my TestFlight release lane, the build uploaded, and then fastlane reported a failure — a 500 “internal server error” from something called ASSET_SPI. The natural move is to retry the upload. I did, through Apple’s Transporter app, and Apple rejected the retry as a duplicate: the build was already there. The 500 hadn’t been the upload failing. It was Apple’s status check failing after the upload had already succeeded. The error was, not to put too fine a point on it, a lie — and figuring that out cost me a chunk of an evening.

Here’s the part that matters for this post. Two weeks and three releases later, that lesson was still operative. Every subsequent release, the assistant flagged it unprompted: if the lane reports an ASSET_SPI 500, don’t re-upload — verify whether the build actually landed first. I never re-explained it. I never re-derived it. A war story from June was still standing guard in July, across dozens of fresh sessions, each of which started — as every Claude session starts — knowing absolutely nothing about me or my project.

That’s what a memory system buys you. But “use memory” is not actually the interesting advice, because every AI-assisted developer I talk to is wrestling with a more tangled set of questions: How often should I clear my context? Does compaction make the model dumber? Should CLAUDE.md be lean or loaded? What’s the actual difference between telling Claude something in the prompt, putting it in CLAUDE.md, and saving it to memory? People have strong opinions about all of these, usually derived from one bad experience and generalized into doctrine.

So this post is my attempt at a working mental model — the one that’s held up across three-plus months and five TestFlight-and-App-Store releases of BaseballScorer. It’s not doctrine. But it’s been load-tested.

The memory hierarchy

If you’ve been an engineer for more than about a week, you already know this shape: registers, cache, RAM, disk. Small-fast-expensive at the top, big-slow-cheap at the bottom, and the whole game is putting each piece of data in the right tier.

Working with Claude Code has exactly this structure. Four tiers:

Tier 1: The conversation context. This is working memory — everything said and done in the current session, including the contents of every file Claude has read. It’s the most powerful tier, because everything in it directly shapes what the model does next. It’s also finite, expensive, and it decays (more on compaction in a minute). Crucially, influence cuts both ways: stale or wrong material in context doesn’t sit there neutrally. It competes with the truth.

Tier 2: CLAUDE.md. Standing orders. This file is loaded at the start of every session, which makes it the most expensive durable real estate you own — every line you put here is a line Claude reads before every single task, forever. It’s also checked into the repo, which turns out to matter more than it first appears.

Tier 3: Persistent memory. The judgment journal. In my setup this is a directory of small markdown files plus an index — the index is loaded every session (like CLAUDE.md, but for accumulated lessons rather than standing orders), and the detail files are pulled in only when relevant. This is where the ASSET_SPI story lives.

Tier 4: The repo itself. Ground truth. The code, the docs/ directory, the git history, the test suite. Effectively unlimited, fully durable, shared with every collaborator — and, critically, verifiable. Claude can read it fresh anytime and trust what it finds, because it’s not a note about the code; it is the code.

And the rule that organizes everything — if you take one sentence from this post, take this one:

Push every fact down to the cheapest tier that preserves it, and treat the conversation as disposable.

If losing your context window right now would hurt, something important is living in the wrong tier. The conversation is where work happens, not where knowledge lives. The moment something in a session turns out to be durably true — a gotcha, a decision, a preference — it should flow downward: into memory, into CLAUDE.md, into a doc in the repo, wherever it belongs. What remains in the conversation should be only the work in progress.

Baseball version, since this is nominally a baseball-app series: the conversation is what’s in the scorer’s head during the play. CLAUDE.md is the ground-rules card taped inside the scorebook. Memory is the scorebook itself. The repo is the rulebook and the league’s official records. Nobody tries to keep the whole season in their head, and nobody should have to reread the rulebook to remember that the ballpark has a short porch in right.

With the model in place, let’s take the contested questions one at a time.

When should you clear the context?

Liberally, and specifically: between unrelated tasks.

The instinct to preserve a long-running conversation comes from a reasonable place — the model feels smarter mid-session, because it has all that context. And it genuinely is, while the context is relevant. The problem is what happens when you pivot. Finish a gnarly print-layout investigation, then start a networking feature in the same session, and all that layout reasoning is still sitting in working memory. It isn’t neutral filler. It’s noise with authority — hundreds of lines of intermediate hypotheses, half of which were wrong (that’s what investigation is), all still whispering to the model while it tries to think about something else.

Old context doesn’t just waste space. Wrong-but-confident material in context is precisely the raw ingredient hallucinations are made of. The model has no typographic marker distinguishing “conclusion we verified” from “hypothesis we abandoned twenty minutes ago”; both are just tokens it once said.

The discipline that makes clearing cheap is the push-down rule. When the print investigation concluded, the conclusion — “the scorecard print is height-bound, not width-bound; future size complaints should target row heights, not column widths” — went into memory. Two sentences. The eight hundred lines of measurement and dead ends that produced those two sentences got thrown away with the session, unmourned. Next time print size comes up, the two sentences come back and the dead ends don’t. That’s not lost information; that’s distilled information.

If clearing your context feels scary, that fear is diagnostic. It means knowledge is trapped in tier 1 that belongs in tier 3 or 4. Fix the filing, and the fear goes away.

Does compaction hurt accuracy?

Some. Here’s the mechanism, because knowing why tells you what to do about it.

When a session runs long, the harness compacts it: older conversation gets replaced by a summary. Summarization is lossy in a very particular way — it preserves narrative and drops precision. “We fixed the auto-advance bug and merged to main” survives compaction beautifully. The exact tag name, the specific line number, the precise flag that carried the fix — those are exactly the details a summary rounds off.

So the practical rule: after a compaction, trust the story, re-verify the specifics. If post-compaction work depends on an exact value — a version number, a build setting, a function signature — the move is to look it up fresh from the repo (tier 4), not to trust the summary’s recollection of it. Ground truth is one file-read away, and unlike the summary, it can’t have rounded anything off.

I can offer this series itself as evidence. These posts have been written across many sessions with the same assistant, through multiple compactions, spanning weeks of feature work in between. The continuity you’re reading — callbacks to Part 2’s war stories, the running motifs — survived not because the context window is heroic but because everything load-bearing lives in files: the draft posts themselves, a memory note tracking the series plan, the repo’s docs. The conversations were disposable, so losing detail from them cost nothing. The hierarchy is what makes compaction survivable, the same way it makes clearing safe. They’re the same insurance policy.

How much belongs in CLAUDE.md?

Less than you’re putting there, probably — but the reason matters more than the rule.

Every line of CLAUDE.md is read at the start of every session, before every task, for the life of the project. That’s its superpower and its cost. The budget question for any candidate line is: does this change Claude’s behavior often enough to justify being read every single time?

Things that clear that bar, from my actual file: the exact build and test commands (with the environment-variable gotcha that makes them work on my machine); the instruction to read the workflow doc before non-trivial work; the warning to never edit the Xcode project file directly while Xcode is open, because that way lies corruption; a note that a particular category of tooling is unreliable for builds so use the command line instead. Every one of those redirects behavior on a large fraction of tasks. They’ve each paid their rent many times over.

Things that don’t clear the bar: architecture narratives, feature history, aspirational coding standards nobody consults, and anything that reads like documentation. The tell is exactly that — if a section reads like documentation, it is documentation, and it belongs in docs/ with a pointer. My CLAUDE.md doesn’t contain my branching and release policy; it contains one line saying “read docs/workflow.md before starting non-trivial work.” The policy lives in tier 4, where it’s versioned, diffable, and readable by humans too. CLAUDE.md just makes sure Claude knows the pointer exists.

So in the great “edit down vs. fill up” debate: edit down, but not out of minimalist aesthetics — out of budget discipline. It’s the most expensive real estate you own. Spend it on behavior, link to everything else.

Memory vs. CLAUDE.md vs. the prompt

This one has the cleanest answer of the bunch, and it comes down to scope and authorship.

CLAUDE.md is checked into the repo. That makes it true-for-anyone: any collaborator, any future contributor, any other agent that clones the project gets the same standing orders. It describes how to work in this codebase. It’s also curated deliberately — you edit it the way you edit code, on purpose, in commits.

Memory is specific to a collaboration. Mine holds things that would be presumptuous or meaningless in a checked-in file: my preferences (I want a high bar for what earns a point release; I’m skeptical of elaborate persona prompts), corrections I’ve issued and why, the current state of in-flight work (“build 31 is on TestFlight awaiting tester assignment”), lessons that encode judgment rather than procedure. It accrues conversationally — “remember this” mid-session — rather than being edited like a source file. If CLAUDE.md is the ground-rules card, memory is the relationship.

The prompt is for this task only. Anything you find yourself typing into prompts repeatedly is a filing error — it’s a durable fact living in the most ephemeral tier, at the cost of your typing it forever. Promote it: repo-truths into CLAUDE.md, collaboration-truths into memory.

The taxonomy of what earns a memory slot, from three months of practice — three categories carry nearly all the value:
1. Corrections, saved with the why. Not “don’t edit project.pbxproj directly” but “don’t edit it while Xcode is open, because external edits can corrupt Xcode’s in-memory state — ask me to make the change in the Xcode UI instead.” The why is what lets the lesson generalize instead of becoming a cargo-cult rule.
2. Validated approaches. When something works and we confirm it worked, that’s as valuable as a correction. The best example from this project: Siri integration silently failed with one Apple API pattern and worked with another (Part 2 readers will remember the AppEnum saga). The memory doesn’t just say which one won; it says what the failure looked like, so the next occurrence gets recognized in minutes instead of hours.
3. Project state that isn’t in the code. What shipped in which build, what’s awaiting whose decision, what the tester feedback said. Git knows what changed; it doesn’t know what we’re waiting on.
And one anti-category: never save what the repo already records. A memory that duplicates the code is a stale copy waiting to mislead. If Claude can look it up in tier 4, it should — which brings us to the sharpest knife in the drawer.

Memory is not live state

Here’s the discipline that separates a memory system that compounds from one that slowly poisons you: a memory is a point-in-time observation, not a fact about the present.

Code moves. Files get renamed, functions get refactored, flags get removed. A memory that says “the fix is the flag on line 600 of such-and-such service” was true the day it was written and gets falser every week. The rule we run: when a memory names a file, a function, a line, a setting — verify it against the current repo before acting on it. The memory’s job is to point; the repo’s job is to be true.

This is the same principle as the compaction rule, and it’s worth saying why: stale information is worse than no information, because it arrives wearing the costume of authority. A model with no memory of your build system will go read the config and get it right. A model with a confident eight-week-old memory of your build system may not think to check. The failure mode of memory isn’t forgetting — it’s remembering wrong, fluently.

Two small hygiene habits fall out of this. First, absolute dates: a memory that says “last Tuesday” is gibberish in a month, so relative time gets converted to real dates at save time. Second, aggressive pruning: when a memory turns out to be wrong or obsolete, it gets deleted, not annotated. Memory is a working set, not an archive — the archive is git.

The payoff: judgment that compounds

Part 2 argued that skills turn workflows into things that happen the same way every time. Memory does the same thing one level up: it makes judgment repeatable. Every war story costs you once and then pays dividends forever — but only if the distillation is good. Three examples from just the past two weeks of BaseballScorer work, because recency is the point:

The ASSET_SPI lie you already know. One bad evening in June; every release since has carried the antidote in its pocket.

The print investigation ended with a two-sentence memory — “height-bound, not width-bound; target row heights” — that converts every future “can the print be bigger?” request from an afternoon of measurement into a thirty-second answer.

Best of all, the batting-around bug. A live game exposed a display bug: when a team bats around, a player can reach base twice in one inning, and any code that matched events to players without also checking sequence mixed the two trips together. We fixed the two places it bit us. But the memory doesn’t record the fix — the commit records the fix. The memory records the pattern: any player-keyed scan over inning events breaks under batting-around unless it’s sequence-bounded. That’s a lesson about a whole class of latent bugs, some of which probably exist in code we haven’t stressed yet. When one surfaces next April, the diagnosis is pre-loaded.

That’s the compounding: fixes accumulate in the repo, but pattern recognition accumulates in memory. One is what happened; the other is what to watch for.

Tutorial mode, briefly

The prescriptions, in the spirit of the previous posts:
- Treat the conversation as disposable, and act accordingly. Distill conclusions downward the moment they’re conclusions. Then clear without fear, especially between unrelated tasks.
- After compaction, trust the narrative and re-verify the numbers. Exact values should come from the repo, not from a summary’s memory of them.
- Budget CLAUDE.md like the expensive real estate it is. Behavior-changing lines only; anything that reads like documentation moves to docs/ and leaves a pointer.
- Scope decides the tier. True for anyone in the repo → CLAUDE.md. True for this collaboration → memory. True for this task → the prompt. Typing it repeatedly → you’ve filed it wrong.
- Save the why with every correction, and save validations, not just failures. Both halves of the feedback signal matter.
- Verify memories before acting on them. Point-in-time observations, not live state. Stale-but-confident is the failure mode.
- Prune as aggressively as you save. Wrong memories don’t age into harmlessness; they age into ambushes.
What’s next

The final post in this series is the one I most wish had existed when I started: moving from standalone Claude Code in a terminal to the Xcode-integrated version — what’s different, what’s missing, what to do instead. If this post was about where knowledge should live, that one is about where the assistant lives, and it turns out the answer changes more than you’d expect.

That’s where we’ll leave things for today.

Part of an ongoing series at Nodes and Edges. The app is on the App Store, and the companion scoring guide lives at scoring.theyawns.com.
July 8, 2026

Saga Orchestration vs Choreography on Hazelcast

Part 10 in the “Building Event-Driven Microservices with Hazelcast” series

Back in Part 4, we built a choreographed saga for order fulfillment. Four services — Order, Inventory, Payment, and Account — coordinate through Hazelcast ITopic events. Each service reacts independently, no central coordinator. The flow is implicit, spread across three saga listeners, a compensation registry, and a timeout detector.

That works. It works well, actually, for loosely coupled flows where services don’t need to know about each other. But I kept running into the same questions: what if you need the whole saga visible in one place? What if you need per-step timeout and retry? What if the caller wants to wait for the saga to finish before responding?

So the framework now supports a second saga architectural pattern — the orchestrated saga. This post compares both, walks through the implementation, and shows how they run side by side in the same system.

When to Use Which

Neither pattern wins across the board. It depends on what you need:

Requirement	Choreography	Orchestration
Services should be fully decoupled	Best
Need the whole flow in one readable file		Best
Caller needs a synchronous response		Best
High throughput (thousands of sagas/sec)	Best
Per-step timeout and retry		Best
Services evolve independently	Best
Complex branching or conditional logic		Best
No single point of failure	Best

Choreography is the better fit when services publish events for many consumers, not just one saga. Adding a new saga consumer doesn’t require changing any existing service — you just stand up a new listener.

Orchestration is the better fit when the saga is a well-defined workflow with a clear owner and the caller (a REST endpoint, typically) wants to return the result directly. Order fulfillment is a textbook example.

Architecture Comparison

Choreographed Flow

Choreographed saga flow: the caller receives 202 Accepted immediately from the Order Service, then OrderCreated, StockReserved, and PaymentProcessed events propagate as asynchronous Hazelcast ITopic messages through Inventory, Payment, and Account services with no central coordinator

Every arrow is an asynchronous event on the shared Hazelcast cluster. The caller gets back a 202 Accepted immediately. The saga completes whenever it completes.

Orchestrated Flow

Orchestrated saga flow: the caller invokes the Saga Orchestrator and waits for the final result while the orchestrator makes synchronous HTTP calls for CreateOrder, ReserveStock, ProcessPayment, and ConfirmOrder to the Order, Inventory, Payment, and Account services

Every arrow is a synchronous HTTP call. The orchestrator waits for each step to finish before moving to the next. The caller gets the final result — success or failure — in the response.

The visual difference tells you most of what you need to know. Choreography is a chain. Orchestration is a hub with spokes.

Implementation Comparison

Choreography: Saga Listeners

In the choreographed pattern, each service has a listener subscribed to events on the shared Hazelcast cluster:

@Component
public class InventorySagaListener {

    public InventorySagaListener(
            @Qualifier("hazelcastClient") HazelcastInstance hazelcast,
            InventoryService inventoryService,
            SagaStateStore sagaStateStore) {

        ITopic<GenericRecord> topic = hazelcast.getTopic("OrderCreated");
        topic.addMessageListener(message -> {
            GenericRecord event = message.getMessageObject();
            String sagaId = event.getString("sagaId");

            // Guard: only process OrderFulfillment sagas
            if (!"OrderFulfillment".equals(event.getString("sagaType"))) return;

            // Perform local action
            inventoryService.reserveStockForSaga(productId, quantity, ...);

            // Update saga state and publish next event
            sagaStateStore.updateOrAddStep(sagaId, 1, StepStatus.COMPLETED);
            hazelcast.getTopic("StockReserved").publish(nextEvent);
        });
    }
}

Three listeners across three services, wired together only by event names. To understand the full flow, you read code in three different modules. Compensation is handled by a CompensationRegistry that maps forward events to their compensating counterparts:

registry.register("OrderCreated", "OrderCancelled", "order-service");
registry.register("StockReserved", "StockReleased", "inventory-service");
registry.register("PaymentProcessed", "PaymentRefunded", "payment-service");

Orchestration: SagaDefinition Builder

The orchestrated version puts the entire saga in one place:

@Component
public class OrderFulfillmentSagaFactory {

    public SagaDefinition create() {
        return SagaDefinition.builder()
                .name("OrderFulfillmentOrchestrated")

                .step("CreateOrder")
                    .action(this::createOrderAction)
                    .compensation(this::createOrderCompensation)
                    .timeout(Duration.ofSeconds(15))
                    .build()

                .step("ReserveStock")
                    .action(this::reserveStockAction)
                    .compensation(this::reserveStockCompensation)
                    .timeout(Duration.ofSeconds(15))
                    .build()

                .step("ProcessPayment")
                    .action(this::processPaymentAction)
                    .compensation(this::processPaymentCompensation)
                    .timeout(Duration.ofSeconds(15))
                    .build()

                .step("ConfirmOrder")
                    .action(this::confirmOrderAction)
                    .noCompensation()
                    .timeout(Duration.ofSeconds(10))
                    .build()

                .sagaTimeout(Duration.ofSeconds(60))
                .build();
    }
}

Four steps, forward actions, compensations, timeouts — all readable in one file. You pay for that readability: the Order Service now has direct knowledge of the Inventory and Payment services’ HTTP endpoints.

The Orchestrator State Machine

HazelcastSagaOrchestrator is the engine that executes a SagaDefinition. The execution flow:

HazelcastSagaOrchestrator execution flow: start records the saga and schedules a 60-second timeout, then each step runs its action under a 15-second timeout, branching to SUCCESS which merges context and executes the next step, FAILURE after retries which compensates in reverse order, or TIMEOUT which compensates in reverse order

Internally, a SagaExecution instance tracks the running state: current step index, completed step names, an AtomicBoolean for compensation (preventing a race between step failure and the saga-level timeout firing at the same moment), and step start timestamps for duration metrics.

Per-Step Retry

Each SagaStep can configure maxRetries and retryDelay. When a step fails, the orchestrator checks if retries remain. If so, it waits retryDelay milliseconds and re-executes. If not, compensation kicks in.

This is separate from the Resilience4j circuit breakers that the choreographed saga listeners use. Different communication styles, different retry mechanisms.

Why HTTP Instead of ITopic?

You might wonder why the orchestrator makes HTTP calls to the other services instead of publishing events on Hazelcast ITopic.

Two reasons.

First, request-response semantics. The orchestrator needs to know whether each step succeeded before proceeding to the next. ITopic is fire-and-forget — there’s no built-in way for a publisher to wait for a consumer’s response. HTTP gives you synchronous request-response for free.

Second, our dual-instance architecture. Each service runs an embedded Hazelcast instance for Jet pipelines and a client to the shared cluster for cross-service events. Jet pipeline lambdas reference service-specific classes that can’t serialize across services — that’s the whole reason for the dual-instance design (see Part 5 for where this architecture first bit us). HTTP sidesteps Hazelcast serialization entirely. Each service processes the request in its own JVM with full access to its own classes.

The SagaServiceClient wraps these calls:

public class SagaServiceClient implements SagaServiceClientOperations {

    public OrchestratedStepResponse reserveStock(
            String productId, int quantity, String orderId) {
        // POST /api/saga/inventory/reserve-stock
        return restTemplate.postForObject(
                inventoryServiceUrl + "/api/saga/inventory/reserve-stock",
                request, OrchestratedStepResponse.class);
    }

    public OrchestratedStepResponse processPayment(
            String orderId, String customerId,
            double amount, String currency, String method) {
        // POST /api/saga/payment/process
        return restTemplate.postForObject(
                paymentServiceUrl + "/api/saga/payment/process",
                request, OrchestratedStepResponse.class);
    }
}

Each remote service exposes dedicated saga endpoints (like /api/saga/inventory/reserve-stock) that return an OrchestratedStepResponse — a success/failure envelope. The SagaServiceClient implements the SagaServiceClientOperations interface, which exists so Mockito can mock it on Java 25. (Mockito’s inline mock maker can’t mock concrete classes there. We hit this in several places — extract an interface, move on.)

Compensation: Two Approaches

Choreography: Event-Based

When a step fails, the SagaCompensator looks up the CompensationRegistry and publishes compensation events via ITopic:

Each service processes its own compensation event independently. The SagaTimeoutDetector can also trigger this if a saga exceeds its deadline.

Orchestration: Lambda-Based

When a step fails, the orchestrator walks completed steps in reverse and executes their compensation lambdas directly:

No events, no registry. The compensation logic sits right next to the forward action in the SagaDefinition. The final step (ConfirmOrder) uses .noCompensation() — there’s nothing to undo once you’ve confirmed.

If a compensation step itself fails, the orchestrator marks the saga as FAILED rather than COMPENSATED. That means manual intervention. It’s not a situation you want, but at least you know about it immediately rather than discovering it later in an audit.

Running Both Simultaneously

Both patterns coexist in the same system. No interference.

Pattern	Saga Type	REST Endpoint
Choreography	OrderFulfillment	POST /api/orders
Orchestration	OrderFulfillmentOrchestrated	POST /api/orders/orchestrated

The key to coexistence is the sagaType field. Choreographed saga listeners filter on it — when the orchestrated flow creates an OrderCreated event, the InventorySagaListener ignores it because the type is “OrderFulfillmentOrchestrated”, not “OrderFulfillment”.

Both patterns write to the same SagaStateStore (a Hazelcast IMap), so you can query across both or filter:

# All sagas
curl http://localhost:8083/api/sagas

# Only choreographed
curl http://localhost:8083/api/sagas?type=OrderFulfillment

# Only orchestrated
curl http://localhost:8083/api/sagas?type=OrderFulfillmentOrchestrated

The MCP list_sagas tool supports the same filter:

list_sagas(type="OrderFulfillmentOrchestrated", status="COMPLETED")

Observability

The orchestrator records a saga.step.duration timer for every step:

private void recordStepDuration(SagaExecution exec, String stepName) {
    if (sagaMetrics != null && exec.stepStartedAt != null) {
        Duration stepDuration = Duration.between(exec.stepStartedAt, Instant.now());
        sagaMetrics.recordStepDuration(
                exec.definition.getName(), stepName, stepDuration);
    }
}

Tagged with sagaType and stepName, so you can query individual steps:

# p95 duration for the ProcessPayment step
histogram_quantile(0.95,
  rate(saga_step_duration_seconds_bucket{
    sagaType="OrderFulfillmentOrchestrated",
    stepName="ProcessPayment"
  }[5m]))

Choreographed sagas track overall saga_duration_seconds but not per-step timing — the flow is distributed across services, so there’s no single place to measure each step. That’s a genuine observability trade-off between the two patterns.

The Grafana saga dashboard has a Choreography vs Orchestration row: p50/p95 duration comparison, success/failure rates per pattern, and an orchestrated step breakdown showing where time goes across CreateOrder, ReserveStock, ProcessPayment, and ConfirmOrder.

The MCP run_demo tool includes orchestrated scenarios:

Scenario	Pattern	Expected Outcome
happy_path	Choreographed	Order confirmed via events
payment_failure	Choreographed	Stock released via compensation events
orchestrated_happy_path	Orchestrated	Order confirmed via HTTP, sync response
orchestrated_payment_failure	Orchestrated	Stock released via reverse compensation, 409 response

The Summary Table

	Choreography	Orchestration
Communication	Hazelcast ITopic events	HTTP calls
Flow definition	Distributed across listeners	Centralized in SagaDefinition
Compensation	CompensationRegistry + event publishing	Reverse-order lambda execution
Timeout handling	SagaTimeoutDetector (scheduled)	Per-step + saga-level timeouts
Response model	Async (202 Accepted)	Sync (201 Created or 409 Conflict)
Retry	Resilience4j (circuit breaker + retry)	Built-in per-step retry with delay
Metrics	Saga-level duration	Saga-level + per-step duration
Saga type	OrderFulfillment	OrderFulfillmentOrchestrated

Choreography is still the right default for most event-sourced systems — it preserves service independence and scales naturally. Orchestration earns its place when you need the flow readable in one file, synchronous responses, and fine-grained per-step control.

Both patterns share the same SagaStateStore, the same Grafana dashboards, and the same MCP tools. Pick the right one for each saga, or run both and compare.

Previous: Dead Letter Queue + Idempotency: Exactly-Once on Hazelcast

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

July 6, 2026

Claude Code Custom Skills & fastlane for iOS Releases
Part 2 of a 4-post series on what I learned shipping BaseballScorer. Part 1 was the arc — first commit to App Store in eighteen days. This one is the machinery underneath: the release workflow, and the handful of custom Claude Code skills I actually use.

Here’s a confession to start with, because it sets up everything else in this post: on my pre-retirement Java projects, I had eight specialized Claude agents. I had config-manager and debugging-helper and documentation-writer and framework-developer and performance-optimizer and pipeline-specialist and service-developer and test-writer. Each had its own persona prompt. Each was going to be the expert in its lane. I built a little org chart of robots and felt very clever about it.

In hindsight: overkill. Almost all of it.

On BaseballScorer I have five skills — bug-fix, release, commit, testflight-upload, and security-review — and I’d argue four of them earn their keep and one is borderline. That’s the whole roster. No personas. No “you are a senior iOS architect with twenty years of experience” preamble. The main agent is already a senior iOS architect with twenty years of experience, or near enough; telling it to pretend to be one is theater.

So if you came here for “here are the twelve agents you need to ship an app,” I’m going to disappoint you on purpose. The thesis of this post is that the highest-leverage Claude Code artifacts on a real project aren’t clever — they’re boring. They encode the multi-step, error-prone, do-it-the-same-way-every-time workflows that you’d otherwise wing each Friday and get subtly wrong. A good skill isn’t a personality. It’s a checklist with teeth.

Let me show you what I mean.

What earns a skill

Here’s the test I landed on, after the Java over-engineering taught me what not to do: a workflow earns a skill when it’s multi-step, painful to do by hand, and — this is the one people skip — dangerous to do inconsistently.

That third criterion is where the value actually lives. A one-step task doesn’t need a skill; you just ask. A multi-step task you do once a year doesn’t need a skill; you look it up. But a multi-step task where doing the steps in the wrong order, or skipping one, quietly corrupts something — that’s where you want the steps welded together so neither you nor Claude can freelance them at 11pm.

Releasing a build is the canonical example. So let’s start there.

fastlane: one place that talks to Apple, and only one

Quick detour for anyone who hasn’t met it — and if you’re new to iOS, you probably haven’t: fastlane is an open-source toolkit that automates the tedious parts of shipping an app. Building the archive, signing it, uploading to TestFlight, pushing screenshots and the App Store description, submitting for review — all the steps you’d otherwise do by hand-clicking through Xcode and the App Store Connect website. You write down what you want once, in a file called a Fastfile, as a named recipe (fastlane calls these “lanes”), and then fastlane ios beta runs the whole recipe the same way every time. Think of it as the difference between following a checklist taped to the wall and pressing a single button that does the checklist for you. Until I started this project I didn’t know it existed either; now I’d no sooner ship without it than score a game without a pencil.

With that out of the way: the single most important rule in my release process is this: exactly one thing is allowed to talk to App Store Connect, and that thing is fastlane, driven from a config file in my repo. I do not log into the App Store Connect website and edit the description. I do not tweak the “What’s New” text in the browser because it’s faster. Everything goes through docs/app-store-metadata.md → fastlane → Apple.

I learned this the way you learn most worthwhile rules — by getting burned. Early on, before fastlane owned the metadata, I added a line to my App Store description in the web UI: “no ads, no paywall.” Felt good. Forgot about it. A few weeks later a routine fastlane push regenerated the listing from a doc in my repo — a doc that didn’t have that line — and silently overwrote my edit. No warning, no diff, no “are you sure.” The web edit and the repo doc were two sources of truth, and when two sources of truth disagree, one of them loses, usually the one you forgot you had.

The fix isn’t “remember not to edit the website.” The fix is to make the repo the only source of truth and let the automation be the only writer. Now if I want to change the description, I change the markdown, and fastlane is the courier. There’s exactly one path, so there’s nothing to get out of sync with.

This is a theme, so I’ll name it now and you’ll see it three more times before we’re done: when something bites you because two things can both do the job, the fix is usually to make sure only one thing can.

The beta lane, and the lesson hiding in its control flow

The skill I lean on most is testflight-upload, which runs my fastlane beta lane. On the surface it’s mundane — it bumps the build number, archives, uploads to TestFlight, and tags the release in git. But there’s a design decision baked into the order of those steps that I want to pull out, because it’s the kind of thing that’s invisible when it works and infuriating when it’s done the other way.

My workflow doc has a rule: tag after the upload succeeds, never before. A failed upload should not burn a version tag. That’s easy to say in a doc and easy to violate in practice — you tag, then upload, then the upload dies, and now you’ve got a tag v1.4-b28 pointing at a build that never made it to Apple. Next time you’ll either reuse the tag (don’t) or skip it (now your tags lie about what shipped).

The trick is that in the beta lane, that rule isn’t a comment reminding me to be careful. It’s control flow. The archive and upload_to_testflight calls come first; the commit_version_bump, add_git_tag, and push_git_tags calls come after. If the upload throws, the lane halts — and execution never reaches the tagging code. You cannot burn a tag on a failed upload because the code that creates the tag is downstream of the code that can fail. The “be careful” rule got promoted from a human responsibility to a structural guarantee.

That’s the move I keep coming back to with skills. Anywhere you find yourself writing “remember to X,” ask whether you can instead arrange things so that not doing X is impossible. A reminder is a liability you carry forever. A structural guarantee you build once.

The lane has a couple of other guards in the same spirit. Before it does anything, it checks that you’re on main with a clean working tree (ensure_git_branch, ensure_git_status_clean) — because releasing from a feature branch with uncommitted experiments is a great way to ship something you didn’t mean to. And it auto-generates the TestFlight changelog from git commit messages since the last v* tag, excluding merge commits. That last bit is small but it means my changelog can’t drift from my actual history, because it is my actual history. One source of truth again. You’ll keep seeing it.

The locale crash, or: how an em-dash took down my release

Now for a war story, because abstract principles are easy to nod at and forget.

The first time I ran the beta lane on this machine, it crashed. Not with a useful error — with this:
```
[!] invalid byte sequence in US-ASCII (ArgumentError)
```
followed, a few lines later, by fastlane helpfully informing me that it “requires your locale to be set to UTF-8.” The proximate cause: macOS shells default to a US-ASCII locale, and fastlane’s build step parses xcodebuild‘s output as it streams by. The first non-ASCII byte in that stream — and there’s always one eventually — and the parser falls over.

And here’s the part that’s almost too on the nose: the non-ASCII byte that took down my release was, as often as not, an em-dash. In my own App Store metadata. Which I write full of em-dashes, because — well, you’ve read this far, you’ve noticed. My prose style was crashing my deployment pipeline. There’s a metaphor in there about the cost of having a voice, but I’ll leave it alone.

The first fix was the obvious one: set the locale on the command line every time.
```
LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 /opt/homebrew/bin/fastlane ios beta
```
That works. But look at what it is — it’s a “remember to X.” Every release, forever, I’d have to remember to prefix the command with the magic words, or watch it die on the first smart quote. That’s exactly the kind of carried liability I just spent a section telling you to eliminate.

So the real fix went into the top of the Fastfile itself:
```
ENV["LANG"] = "en_US.UTF-8" unless ENV["LANG"]&.include?("UTF-8")
ENV["LC_ALL"] = "en_US.UTF-8" unless ENV["LC_ALL"]&.include?("UTF-8")
```
Now the trap is disarmed permanently. The lane sets its own locale before it does anything else, so it doesn’t matter what shell I run it from or whether I remembered the incantation. The gotcha can’t recur because the tool defends itself. Same pattern as the tag-on-success thing: take a rule that lived in my head and move it into a place where it’s enforced by code.

If there’s one transferable habit from this whole post, it’s that one. When you hit an environment gotcha, the fix is not to remember it. The fix is to make it impossible to hit again, in the most permanent place you can put the fix.

The bug-fix skill: branch off the buggy tag

The other skill that genuinely changed how I work is bug-fix, and it’s worth explaining because it encodes a habit that I’m told is less common than I’d assumed from my pre-Claude career.

When a bug ships in, say, build v1.3-b26, the fix does not start from current main. It starts from the tag — git checkout -b bugfix/short-name v1.3-b26. You branch from the code that actually shipped the bug.

Why bother? Two reasons, both about honesty. First, the skill makes you write a failing reproducer test before the fix — a test named test_bugfix_<shortDescription> that demonstrates the bug. And a reproducer test is only trustworthy if it reproduces the bug on the code that shipped it. If you write your test against current main, where the symptom may have already shifted or been accidentally masked by other changes, you might write a test that passes for the wrong reason and convince yourself you’ve fixed something you haven’t. Branching from the tag guarantees the test fails for the real reason before it passes for the real reason.

Second, it gives you a clean merge path forward. The fix and its test travel together from the tag up to main, and the reproducer stays in the suite forever as a tripwire against regressions. I’ve got a couple of recent ones from the 1.4 cycle — an error that was getting credited to the wrong team in the box score, and a runner-advancement display that rendered a hanging “advanced to ” with no destination — and in both cases the value wasn’t just the fix. It was that the test which proves the fix is now a permanent member of a 365-test suite that runs before every release. The bug can come back, but it can’t come back quietly.

The discipline of test-first matters even when — especially when — Claude is the one writing the test. It keeps both of us honest about whether we’re fixing the actual bug or just papering over the symptom that happened to be visible. It’s very easy to make a symptom disappear. It’s harder, and more valuable, to prove you understood it.

When the automation breaks (and it will)

I want to close the practical part with the least glamorous lesson, because it’s the one nobody puts in their “ship with Claude!” thread: every piece of automation needs a documented recovery procedure, and that procedure belongs right next to the automation, written while you’re calm.

Two examples from this project, both real, both having cost me an evening.

The beta lane bumps the build number across every target — app, tests, screenshots — before it archives. If it crashes mid-lane (say, on a locale issue before I’d pinned the fix), it’s already dirtied the project file, and the next run’s clean-tree guard refuses to proceed. The first time this happened I flailed. Now there’s a known dance: revert the app target’s build number in Xcode’s UI (not by hand-editing the project file while Xcode’s open — that way lies corruption), commit the leftover diff with a “cleanup from failed run” note, and re-run. The lane re-bumps everything to the next number. Skipping a build number is fine, by the way — Apple only requires that build numbers go up, not that they’re contiguous. That fact alone would’ve saved me twenty minutes of panic if I’d known it.

The second one is sneakier and I love it as a cautionary tale. On one upload, fastlane reported a flat-out failure: an ASSET_SPI 500, “internal server error,” during the post-upload status check. So I did the natural thing and retried the upload through Transporter — which Apple promptly rejected, because the build was already there. The 500 wasn’t the upload failing. It was Apple’s status-check endpoint failing after the upload had already succeeded. The error message was, not to put too fine a point on it, a lie. The only reason I figured it out is that the duplicate-rejection error (bundle version already used) told the truth that the 500 had obscured.

The lesson there isn’t about fastlane specifically. It’s: don’t trust an error message about a remote system’s state — verify the actual state. Apple told me the upload failed. Apple was wrong. The build was sitting in App Store Connect the whole time. When a distributed system reports a failure, it’s reporting that one call failed, which is not the same as the operation having failed, and the gap between those two things is where you lose evenings if you take the error at face value.

(If that distinction sounds familiar, it’s the same reason “the network is unreliable” is the first hard lesson in distributed systems. A failed acknowledgment doesn’t tell you the work didn’t happen. It tells you that you didn’t hear that it happened. Apple’s 500 was a lost ack, nothing more.)

All of this — the recovery dances, the “skip a build number, it’s fine,” the “the 500 is a liar” — lives in a doc in my repo and a couple of memory notes Claude carries between sessions. Which is the natural segue to where we’re headed next.

The actual point

Strip away the war stories and here’s what the five skills and the one config file have in common: none of them make Claude smarter. The model was already plenty smart. What they do is make the process repeatable and the hard-won lessons durable. The locale fix, the tag-on-success ordering, the branch-off-the-tag habit, the single source of truth for metadata — every one of those is a place where a mistake I made once got promoted into something I can’t easily make again.

That’s the unsexy truth about being productive with an AI coding assistant on a real, shipping project. The leverage isn’t in elaborate prompts or a cast of specialized agents with backstories. It’s in noticing which boring workflows are error-prone, encoding them so they happen the same way every time, and turning each war story into a guardrail before you have to fight the same war twice. A skill is just where a hard-won lesson goes to become a habit.

Which raises an obvious question: how does any of that survive across months of work, when each Claude session starts fresh and remembers nothing? How does the lesson from June still be there in September? That’s the persistent-memory system, and it’s the subject of the next post — the same idea as this one, lifted up one level, from “make this workflow repeatable” to “make this project’s accumulated judgment repeatable.” That’s where we’ll leave things for today.

Part of an ongoing series at Nodes and Edges. If you’re curious about the app itself, it’s on the App Store, and the companion scoring guide lives at scoring.theyawns.com.
July 3, 2026

Dead Letter Queue + Idempotency: Exactly-Once on Hazelcast

Part 9 in the “Building Event-Driven Microservices with Hazelcast” series

Over the past two articles, we built resilience into both sides of our saga communication. Part 7 added circuit breakers and retry to protect saga listeners against transient failures during event consumption. Part 8 added the transactional outbox to guarantee event delivery from producer to shared cluster.

Two gaps remain.

First: what happens when an event fails processing permanently? The circuit breaker exhausts retries. NonRetryableException gets thrown. The event is gone — all that survives is a log message. There’s no way to inspect what failed, understand why, or retry it later when someone fixes the underlying problem.

Second: what happens when the outbox delivers an event twice? At-least-once delivery means duplicates are possible. Without protection, the Inventory Service might reserve stock twice for the same order. The Payment Service might charge the customer twice.

This article covers two complementary patterns that close these gaps. The dead letter queue captures events that fail consumer-side processing, giving operators a way to inspect, replay, and discard them. The idempotency guard ensures each event is processed exactly once, even if delivered multiple times.

Put them together with the outbox and you get effectively-once semantics — at-least-once delivery on the producer side, exactly-once processing on the consumer side. That’s the gold standard for event-driven systems.

Part 1: Dead Letter Queue

The Problem

Consider this failure sequence in the Inventory Service’s saga listener:

Permanent failure without a dead letter queue: an OrderCreated event arrives, the InventorySagaListener picks it up and calls executeWithResilience, a non-retryable InsufficientStockException is thrown, the circuit breaker records the failure, a ResilienceException propagates, the error is logged, and the event is gone with only a log line surviving

That log message? That’s all you’ve got. In production, recovering from this means searching logs for the event ID, reconstructing the event payload from other sources, manually fixing whatever went wrong, and then figuring out how to re-trigger the saga step.

A dead letter queue captures the failed event — payload, failure reason, saga context, source service, everything — in a durable store that you can actually query and act on.

The DeadLetterEntry

Each DLQ entry preserves the full failure context:

public class DeadLetterEntry {

    private String dlqEntryId;       // UUID — unique DLQ identifier
    private String originalEventId;  // The event that failed
    private String eventType;        // "OrderCreated", "StockReserved", etc.
    private String topicName;        // The ITopic where the event was published
    private GenericRecord eventRecord; // The complete event payload for replay
    private String failureReason;    // Why processing failed
    private Instant failureTimestamp; // When the failure occurred
    private String sourceService;    // Which service failed ("inventory-service")
    private String sagaId;           // Saga context for tracing
    private String correlationId;    // Correlation context for tracing
    private int replayCount;         // How many times this entry has been replayed
    private Status status;           // PENDING, REPLAYED, or DISCARDED

    public enum Status {
        PENDING,    // Awaiting review or replay
        REPLAYED,   // Re-published to original topic
        DISCARDED   // Manually discarded by administrator
    }
}

Construction at the failure site uses a builder:

DeadLetterEntry.builder()
    .originalEventId(eventId)
    .eventType(record.getString("eventType"))
    .topicName("OrderCreated")
    .eventRecord(record)
    .failureReason(error.getMessage())
    .sourceService("inventory-service")
    .sagaId(record.getString("sagaId"))
    .correlationId(record.getString("correlationId"))
    .build();

The eventRecord field is the important one — it holds the complete GenericRecord that was published to the ITopic. When you replay the entry, this exact record gets re-published to the original topic, picking the saga back up where it left off.

The DeadLetterQueueOperations Interface

Same interface-extraction pattern we used for ResilientOperations and ServiceClientOperations (Java 25 Mockito can’t mock concrete classes, so we keep extracting interfaces — it’s becoming a running theme):

public interface DeadLetterQueueOperations {

    void add(DeadLetterEntry entry);

    List<DeadLetterEntry> list(int limit);

    DeadLetterEntry getEntry(String dlqEntryId);

    void replay(String dlqEntryId);

    void discard(String dlqEntryId);

    long count();
}

HazelcastDeadLetterQueue: IMap-Backed Storage

The implementation stores DLQ entries as Compact-serialized GenericRecords in a Hazelcast IMap — same pattern as the HazelcastOutboxStore:

public class HazelcastDeadLetterQueue implements DeadLetterQueueOperations {

    private static final String SCHEMA_NAME = "DeadLetterEntry";
    private final HazelcastInstance hazelcast;
    private final IMap<String, GenericRecord> dlqMap;
    private final DeadLetterQueueProperties properties;
    private final MeterRegistry meterRegistry;

    public HazelcastDeadLetterQueue(HazelcastInstance hazelcast,
                                     DeadLetterQueueProperties properties,
                                     MeterRegistry meterRegistry) {
        this.hazelcast = hazelcast;
        this.dlqMap = hazelcast.getMap(properties.getMapName());
        // ...
    }
}

The DLQ map lives on the shared cluster (falling back to the embedded instance if there’s no shared cluster), so it’s accessible from any service’s admin endpoint. You don’t need to know which service failed — query the DLQ from anywhere and you’ll see everything.

POJO-to-GenericRecord Conversion

Like the outbox store, conversion happens at the boundary:

static GenericRecord toRecord(final DeadLetterEntry entry) {
    return GenericRecordBuilder.compact(SCHEMA_NAME)
            .setString("dlqEntryId", entry.getDlqEntryId())
            .setString("originalEventId", entry.getOriginalEventId())
            .setString("eventType", entry.getEventType())
            .setString("topicName", entry.getTopicName())
            .setGenericRecord("eventRecord", entry.getEventRecord())
            .setString("failureReason", entry.getFailureReason())
            .setInt64("failureTimestamp", entry.getFailureTimestamp().toEpochMilli())
            .setString("sourceService", entry.getSourceService())
            .setString("sagaId", entry.getSagaId())
            .setString("correlationId", entry.getCorrelationId())
            .setInt32("replayCount", entry.getReplayCount())
            .setString("status", entry.getStatus().name())
            .build();
}

Note setGenericRecord(“eventRecord”, …) — Compact serialization handles nested GenericRecords natively. The full event payload comes along for the ride without any special serialization work on our part.

Replay

This is where the DLQ earns its keep. Once you’ve figured out what went wrong and fixed it — restocked inventory, restarted a flaky service, whatever — you replay the entry:

@Override
public void replay(final String dlqEntryId) {
    final GenericRecord record = dlqMap.get(dlqEntryId);
    if (record == null) {
        throw new IllegalArgumentException("DLQ entry not found: " + dlqEntryId);
    }

    final DeadLetterEntry entry = fromRecord(record);

    if (entry.getStatus() != DeadLetterEntry.Status.PENDING) {
        throw new IllegalStateException(
                "Cannot replay entry in status " + entry.getStatus());
    }
    if (entry.getReplayCount() >= properties.getMaxReplayAttempts()) {
        throw new IllegalStateException(
                "Max replay attempts (" + properties.getMaxReplayAttempts() + ") exceeded");
    }

    // Re-publish to the original topic
    final GenericRecord eventRecord = entry.getEventRecord();
    if (eventRecord != null && entry.getTopicName() != null) {
        final ITopic<GenericRecord> topic = hazelcast.getTopic(entry.getTopicName());
        topic.publish(eventRecord);
    }

    // Update entry status
    entry.setReplayCount(entry.getReplayCount() + 1);
    entry.setStatus(DeadLetterEntry.Status.REPLAYED);
    dlqMap.set(dlqEntryId, toRecord(entry));

    meterRegistry.counter("dlq.entries.replayed").increment();
}

A few safety guards here. Only PENDING entries can be replayed — you can’t accidentally replay something that was already replayed or discarded. There’s a configurable max replay count (default 3) to prevent infinite replay loops if the underlying issue isn’t actually fixed. And if the eventRecord is somehow null (shouldn’t happen, but defensive coding), the status updates without attempting a publish.

Monitoring Queue Depth

The count() method uses a Hazelcast predicate to count only PENDING entries:

@Override
public long count() {
    final Collection<GenericRecord> pending = dlqMap.values(
            Predicates.equal("status", DeadLetterEntry.Status.PENDING.name()));
    return pending.size();
}

A DLQ count above zero for more than a few minutes is a flag that something needs attention. Wire this to an alert and you’ll know about failed events before anyone files a ticket.

Admin REST Endpoints

The DeadLetterQueueController exposes the DLQ through REST:

@RestController
@RequestMapping("/api/admin/dlq")
@Tag(name = "Dead Letter Queue")
public class DeadLetterQueueController {

    @GetMapping
    public ResponseEntity<List<DeadLetterEntry>> list(
            @RequestParam(defaultValue = "20") int limit) {
        return ResponseEntity.ok(deadLetterQueue.list(limit));
    }

    @GetMapping("/count")
    public ResponseEntity<Map<String, Long>> count() {
        return ResponseEntity.ok(Map.of("count", deadLetterQueue.count()));
    }

    @GetMapping("/{id}")
    public ResponseEntity<DeadLetterEntry> getEntry(@PathVariable String id) { ... }

    @PostMapping("/{id}/replay")
    public ResponseEntity<Map<String, String>> replay(@PathVariable String id) { ... }

    @DeleteMapping("/{id}")
    public ResponseEntity<Map<String, String>> discard(@PathVariable String id) { ... }
}

A typical investigation looks like this:

# How many pending entries?
curl http://localhost:8082/api/admin/dlq/count
# {"count": 2}

# What are they?
curl http://localhost:8082/api/admin/dlq
# [{"dlqEntryId":"abc-123", "originalEventId":"evt-456",
#   "eventType":"OrderCreated", "failureReason":"Insufficient stock for product PROD-789",
#   "sourceService":"inventory-service", "status":"PENDING", ...}]

# Get the full details on one
curl http://localhost:8082/api/admin/dlq/abc-123

# Fix the problem (restock inventory), then replay
curl -X POST http://localhost:8082/api/admin/dlq/abc-123/replay
# {"status":"replayed", "dlqEntryId":"abc-123"}

# Or discard if the saga already timed out and compensation ran
curl -X DELETE http://localhost:8082/api/admin/dlq/abc-123
# {"status":"discarded", "dlqEntryId":"abc-123"}

Integration with Saga Listeners

Each saga listener injects the DLQ as an optional dependency:

@Autowired(required = false)
public void setDeadLetterQueue(DeadLetterQueueOperations deadLetterQueue) {
    this.deadLetterQueue = deadLetterQueue;
}

Failed events get routed to the DLQ in the error handler:

private void sendToDeadLetterQueue(GenericRecord record, String topicName, Throwable error) {
    String eventId = record.getString("eventId");
    if (deadLetterQueue != null) {
        try {
            deadLetterQueue.add(DeadLetterEntry.builder()
                    .originalEventId(eventId)
                    .eventType(record.getString("eventType"))
                    .topicName(topicName)
                    .eventRecord(record)
                    .failureReason(error.getMessage())
                    .sourceService("inventory-service")
                    .sagaId(record.getString("sagaId"))
                    .correlationId(record.getString("correlationId"))
                    .build());
            logger.warn("Event {} sent to DLQ after failure: {}", eventId, error.getMessage());
        } catch (Exception dlqError) {
            logger.error("Failed to send event {} to DLQ: {}", eventId, dlqError.getMessage());
        }
    } else {
        // Fallback: existing behavior (log only)
        if (error instanceof ResilienceException) {
            logger.warn("Circuit breaker open, saga step deferred: eventId={}", eventId);
        } else {
            logger.error("Failed to process event: {}", eventId, error);
        }
    }
}

The try/catch around deadLetterQueue.add() is defensive. If the DLQ itself fails — shared cluster unreachable, say — we fall back to logging. The DLQ is best-effort, not a hard requirement. Losing an event and failing to capture it in the DLQ would be truly unlucky, but it shouldn’t bring the service down.

Part 2: Idempotency Guard

The Problem

The transactional outbox gives us at-least-once delivery. Combined with ITopic’s own delivery behavior (listeners that reconnect after a brief disconnection may receive messages again), the same event can arrive at a consumer more than once:

Duplicate delivery sequence: the OutboxPublisher publishes OrderCreated to the shared cluster, which forwards it to the Inventory Listener; the markDelivered call times out, so on the next poll cycle the publisher re-publishes the same event, and without protection the Inventory Listener performs a double stock reservation

Without protection, inventory gets reserved twice. The customer gets charged twice. The order gets confirmed twice. Nobody wants that.

Atomic Check-and-Claim

The fix is Hazelcast’s putIfAbsent — an atomic, cluster-wide check-and-set that ensures each event ID gets processed exactly once:

public class HazelcastIdempotencyGuard implements IdempotencyGuard {

    private final IMap<String, Long> processedEventsMap;
    private final long ttlMillis;
    private final MeterRegistry meterRegistry;

    public HazelcastIdempotencyGuard(HazelcastInstance hazelcast,
                                      IdempotencyProperties properties,
                                      MeterRegistry meterRegistry) {
        this.processedEventsMap = hazelcast.getMap(properties.getMapName());
        this.ttlMillis = properties.getTtl().toMillis();
        this.meterRegistry = meterRegistry;
    }

    @Override
    public boolean tryProcess(final String eventId) {
        Long previous = processedEventsMap.putIfAbsent(
                eventId, System.currentTimeMillis(), ttlMillis, TimeUnit.MILLISECONDS);

        boolean firstTime = (previous == null);
        meterRegistry.counter("idempotency.checks",
                "result", firstTime ? "miss" : "hit").increment();

        if (!firstTime) {
            logger.debug("Duplicate event detected: eventId={}", eventId);
        }

        return firstTime;
    }
}

The interface is one method:

public interface IdempotencyGuard {
    boolean tryProcess(String eventId);
}

Returns true if this is the first time the event ID has been seen — go ahead and process it. Returns false if someone already claimed it — skip.

How putIfAbsent Works

IMap.putIfAbsent(key, value, ttl, timeUnit) is atomic. If the key doesn’t exist, it inserts the pair and returns null. If it does exist, it returns the existing value and does nothing. This atomicity holds across cluster members — two listeners on different nodes processing the same event simultaneously will never both get null. Exactly one wins, the other backs off.

TTL: Forgetting Old Events

The putIfAbsent includes a TTL (default: 1 hour). After that, the event ID is removed from the map, and the same event could theoretically be reprocessed if it somehow arrived again.

Why an hour? It’s a memory management decision. Without a TTL, the processed events map grows forever. With a 1-hour window, we hold at most an hour’s worth of event IDs, which is bounded and predictable. Since our outbox publisher has a 1-second poll interval with 5 retries, duplicates arrive within seconds — an hour of margin is more than sufficient.

Integration with Saga Listeners

Each saga listener checks the guard at the top of its message handler:

class OrderCreatedListener implements MessageListener<GenericRecord> {

    @Override
    public void onMessage(Message<GenericRecord> message) {
        GenericRecord record = message.getMessageObject();

        String eventId = record.getString("eventId");
        if (idempotencyGuard != null && eventId != null
                && !idempotencyGuard.tryProcess(eventId)) {
            logger.debug("Duplicate event {} already processed, skipping", eventId);
            return;
        }

        // ... proceed with normal processing
    }
}

Three null checks for graceful degradation: if idempotency isn’t configured, process everything (no deduplication). If the event doesn’t have an ID, skip the check. If tryProcess() returns false, it’s a duplicate — drop it silently.

The Processed Events Map

The map lives on the shared cluster, so deduplication works across all service instances:

Key	Value	TTL
evt-abc-123	1738000000000 (timestamp)	1 hour
evt-def-456	1738000001000	1 hour
evt-ghi-789	1738000002000	1 hour

The value — a processing timestamp — is purely informational. Only the key’s presence or absence matters for deduplication. But the timestamp is handy for debugging: it tells you exactly when an event was first processed.

How the Three Patterns Work Together

The outbox, DLQ, and idempotency guard form a complete reliability pipeline:

The three patterns working together: on the producer side the EventSourcingController writes to the OUTBOX IMap, the OutboxPublisher polls and publishes to the shared ITopic, then marks the entry delivered; on the consumer side the saga listener checks the IdempotencyGuard (duplicates are skipped), runs executeWithResilience, and on failure routes the event to the framework_DLQ IMap for admin replay or discard

Let’s walk through what happens when things go wrong.

The OrderCreated event comes out of the Jet pipeline and gets written to the outbox. The OutboxPublisher picks it up, publishes to the shared cluster’s OrderCreated ITopic, and tries to mark it DELIVERED. But the markDelivered call times out. Next poll cycle, the publisher re-publishes the same event. Now it’s been delivered twice.

Over on the consumer side, the Inventory Service’s OrderCreatedListener receives both copies. The first call to idempotencyGuard.tryProcess(“evt-123”) returns true — process it. The second call returns false — duplicate, skip it. Only one stock reservation happens.

But that first delivery hits a problem: the product is out of stock. InsufficientStockException is non-retryable. The circuit breaker records the failure, ResilienceException propagates up to whenComplete(), and sendToDeadLetterQueue() captures everything — the full event payload, the failure reason, the saga ID, the source service. It’s all sitting in the framework_DLQ IMap, waiting.

An operator (or an LLM, as we’ll see in a moment) checks the DLQ, sees the pending entry, restocks the product, and replays the event. The OrderCreated record gets re-published to the ITopic, the saga picks up, and the order completes.

One wrinkle: the replayed event carries the same eventId as the original. If the 1-hour idempotency TTL hasn’t expired yet, the guard will block it as a duplicate. In practice this isn’t an issue — by the time you’ve investigated the failure, diagnosed the root cause, and fixed it, an hour has usually passed. It’s a deliberate trade-off: short-window deduplication versus immediate replay. We chose deduplication.

Configuration Reference

Dead Letter Queue: framework.dlq.*

Property	Default	Description
enabled	true	Master toggle
map-name	framework_DLQ	IMap name on shared cluster
max-replay-attempts	3	Maximum replays before permanent block
entry-ttl	168h	7-day retention for DLQ entries

Idempotency Guard: framework.idempotency.*

Property	Default	Description
enabled	true	Master toggle
map-name	framework_PROCESSED_EVENTS	IMap name on shared cluster
ttl	1h	How long to remember processed event IDs

Metrics

Metric	Type	Description
dlq.entries.added	Counter	Events added to the DLQ
dlq.entries.replayed	Counter	Events replayed from the DLQ
dlq.entries.discarded	Counter	Events discarded from the DLQ
idempotency.checks	Counter (tagged: result=hit\|miss)	Deduplication checks

The Complete Resilience Stack

Across Parts 7, 8, and 9, we’ve built five interlocking patterns:

Pattern	Layer	Purpose	Protects Against
Circuit Breaker	Consumer	Automatic service isolation	Cascade failures
Retry + Backoff	Consumer	Transient failure recovery	Network blips, brief outages
Transactional Outbox	Producer	Guaranteed delivery	Shared cluster unavailability
Dead Letter Queue	Consumer	Failure capture and replay	Permanent processing failures
Idempotency Guard	Consumer	Exactly-once processing	Duplicate delivery

They’re all optional — enabled by default, disabled with a single property toggle. They’re all auto-configured by Spring Boot. They all expose Micrometer metrics. And when disabled, the framework falls back to its previous behavior without breaking anything.

Three articles ago, we had a fire-and-forget event pipeline where a network blip could lose an event forever. Now we have guaranteed delivery, deduplication, failure capture, and replay. Same pipeline, five patterns later.

Try It Yourself

The demo script includes a complete DLQ investigation scenario — fault injection, failure capture, investigation, and replay — in 11 guided steps:

./scripts/demo-scenarios.sh 7

That’s the curl-based version. No LLM required.

The AI-Powered Version

This is more fun. Connect the MCP server from Part 6 to your LLM client — Claude Desktop, Claude Code, ChatGPT, whatever you’ve got — and try this prompt:

“Run the DLQ investigation demo — inject a failure, place an order, and show me what’s in the dead letter queue.”

The LLM calls runDemo to set up the scenario, then listDlqEntries and inspectDlqEntry to investigate. It tells you what happened — which event failed, at which service, and why — and suggests a fix. You say “replay it.” It calls replayDlqEntry, the saga completes, and you’ve just done incident response through a conversation.

No curl commands. No JSON parsing. No copy-pasting UUIDs. The LLM handles the plumbing while you make the decisions.

If the LLM already has context from earlier in the session, a shorter version works:

“Run the dlq_investigation demo scenario and tell me what you find.”

Next up: Choreography vs Orchestration: Two Saga Patterns

Previous: Hazelcast Transactional Outbox: Guaranteed Delivery

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

June 29, 2026

Claude Code for iOS: Shipping a Real App in 18 Days
Part 1 of a 4-post series on what I learned shipping BaseballScorer — from first commit to a usable App Store release in under three weeks, plus everything that’s come after.

I have files on my laptop dated January 7, 2009. They’re the start of an iOS baseball scoring app, written in Objective-C, abandoned partway through the lineup management screens after several other apps beat me to the App Store. I shipped 1.0 of BaseballScorer on April 15, 2026 — about seventeen years and three months later. The gap between those dates isn’t a story about Swift vs. Objective-C. It’s a story about productivity floors.

The 2009 version stalled because building a real iOS app — even one whose design I’d been sketching since the Apple Newton — was a part-time hobbyist’s nightmare. The 2026 version shipped because Claude Code took “build the version of this app I actually want, even though the market is crowded with perfectly good alternatives” from a fantasy into a practical project. The first commit landed on March 28, 2026 — the regular season was about to start. The 1.0 release went live on the App Store eighteen days later, and 1.0 wasn’t a hollow milestone-for-the-sake-of-shipping. It was a genuinely usable scoring app — one you could take to a ballgame and actually score a game with. The two months between 1.0 and 1.3 have been a steady cadence of upgrades, increasingly guided by feedback from real users — both App Store downloaders and folks on the public TestFlight link — rather than by my own backlog. There’s still plenty more in the pipeline.

This is the first of four posts where I try to be honest about what that looked like. Not “Claude wrote my app for me” — that’s not what happened — but a frank account of what I brought, what Claude brought, what went well, what I regret, and what I’d do differently. The next three posts will go deep on (2) the release workflow and the custom Claude Code skills I actually use, (3) the persistent memory system that lets a single Claude conversation feel coherent across months, and (4) the specific differences between running standalone Claude Code in a terminal and the Xcode-integrated version — which is the post I most wish I’d had when I started. This one is the arc.

Why ship into a crowded market?

If you search “baseball scoring app” in the App Store right now you’ll find plenty of decent options. I know, because I checked, repeatedly, every time I asked myself whether this was a sensible use of my time. The honest answer is: no, not by any normal definition of “sensible.” I’m not going to dethrone anyone. Most baseball-scoring app users are loyal to whatever they learned first, and they should be — the existing options work fine.

The reason I built it anyway is the same reason you might build your own task tracker even though Todoist exists. I had a specific mental model of how scoring an iPad baseball game should feel, and none of the existing apps matched it. Some were too “this is a database, please fill it in.” Others tried to be too clever about inferring plays and left me fighting them when I wanted to record something unusual. My design philosophy — which I’ll come back to in a minute — is “the app trusts you.” Everything is optional. Nothing blocks you from moving forward. You can be sloppy and still end up with a usable scorecard, because in the bleachers, sometimes you have to be sloppy.

The other “why now” factor: I’d recently transitioned mostly into retirement, but I was a computer nerd before anyone paid me to be one, so “stop doing tech because no one’s paying me” was never going to be the deal. A project I genuinely wanted to use was the right shape for that phase of life. Side projects without bosses tend to either die fast or finish well, and this one was going to do one of the two.

Here’s the part that’s relevant to the Claude Code angle: that specificity is exactly the kind of thing that used to make “build it yourself” infeasible. Not because the design was hard — most of the design was twenty-plus years old, sitting in my head since the Newton days. It was infeasible because the cost of translating a clear design into working SwiftUI + SwiftData code, with reasonable test coverage and a clean release process, exceeded what I could spend on a side project. Claude Code dropped that cost enough that “build my own version of an app that already exists” went from “fun fantasy” to “actually happening on weekends.”

If you have a personal-version-of-an-existing-app project that you’ve been sitting on, this is the part of the post where I tell you to just start it. You don’t need a market opportunity. You need a productivity floor low enough that doing it for yourself is a reasonable trade for your time.

What came from where

Almost every Claude Code post I’ve read leaves the credit question vague. Mine won’t. Here’s the honest division on BaseballScorer:

From me:

The first two bullets below trace back to a Newton-era bitmap mockup I made decades ago. The rest emerged during this project, mostly from the iPad form factor making certain choices obvious.
- The basic layout — a line score across the top, and then the rest of the screen is the main scoring area with tap targets for the fielding sequence, inning summary down the left, previous at-bats across the top
- The idea of tapping bases to drive baserunner actions
- The “the app trusts you” philosophy — every field optional, an incomplete at-bat never blocks progress, casual scoring is the default. Getting distracted or interrupted and missing a play shouldn’t make it impossible to continue.
- The decision to make portrait orientation the scoring view and landscape the scorecard grid (an iPad-driven call — the Newton mockup had no equivalent)
- The K vs. Kc distinction (swinging strikeout vs. called/looking)
- iPad-primary with iPhone as an adaptive secondary
From Claude, almost entirely:
- The color system for ball / strike / foul / hit-by-pitch. I’d envisioned the buttons monochrome with inapplicable ones dimmed. Claude proposed a color encoding and I liked it immediately. It’s now one of my favorite things about the app.
- Flipping the button set between “pitch results” and “in-play outcomes” depending on the moment in the at-bat. My original design had every button visible all the time with the inapplicable ones grayed out. The flip is better. I didn’t see it.
- Most of the SwiftUI idiom. My only prior iOS App Store release was a collectible-card-game companion app, written in Objective-C years ago — nothing to do with baseball, nothing to do with Swift. BaseballScorer is my first Swift project and my first SwiftUI project. Claude carried me through the language and framework. I had strong opinions about what the UI should do. Claude knew how to make SwiftUI actually do it.
Heavily collaborative:
- The data model. I had an event-sourcing mental model from a separate project, and Claude knew SwiftData’s quirks. We arrived at the current Game → Inning → AtBat → PlayEvent structure together. (We also made an architectural decision there I now regret — more on that below.)
- The release workflow and the custom skills. I brought the discipline; Claude wrote most of the actual fastlane glue and the skill definitions.
- The test discipline. 365 unit tests, zero failing, as of v1.3-b26. I insisted on the failing-reproducer-test-first habit for bug fixes; Claude wrote most of the tests.
This is, I think, the actually-honest shape of a productive human/AI collaboration on a real codebase. It’s not “Claude built it.” It’s not “I built it with Claude as a fancy autocomplete.” It’s a real division of labor where one side brings vision and judgment and the other side brings language fluency and willingness to write the boring parts, and they meet in the middle on the interesting parts.

The structural mistake (and the screenshot that proved it was real)

In early April 2026, between TestFlight builds 6 and 7, a tester I’d never met sent me a screenshot via the public TestFlight link. He was trying to catch up to a live NYY-at-TB game using my MLB-feed catch-up path. The screenshot showed four distinct symptoms in one frame:
1. Three outs filled in on the indicator, but the half-inning hadn’t ended and the active-batter card was still up
2. The active batter card showed Goldschmidt (a Yankees player) while TB was supposed to be batting
3. The runner-action prompt offered “Stay on 3rd” — but the diamond showed no runner on 3B
4. The at-bat history rendered out of chronological order (1st → 5th → 3rd instead of 1st → 3rd → 5th)
Each of those symptoms looked like a different bug. They were not. They were four faces of the same structural problem.

A few months earlier I had written, mostly for my own future reference, a document called docs/architecture-retrospective.md — the kind of “what would I do differently” file you write after a long debugging session, more for catharsis than for action. It listed five “structural pain points” — places where the data model wasn’t wrong exactly, but was generating recurring bug classes rather than one-off bugs. The five pain points it called out:
1. AtBat is doing too much (it’s a historical record and a container for events and a lookup point for rendering)
2. Player identity in events is fragile (SwiftData persistent identifiers can be temporary until the next save — found this out the hard way)
3. State has two implementations (live view-model state vs. reconstructed-from-history state) that drift
4. Catch-up from MLB feed and manual scoring are parallel implementations that diverge subtly
5. Substitution semantics (pinch hitters, pinch runners, defensive substitutions) are tangled across three different storage locations
The retrospective predicted that these pain points would generate exactly the bug classes that the tester’s screenshot demonstrated. Reading the report, I could point at each symptom and say which structural pain it came from. That’s a useful diagnostic moment and a horrible feeling at the same time. The doc had explicitly listed “the same bug class keeps recurring” as a triggering criterion for pulling refactor work forward. The screenshot tripped it.

I pulled three refactors that were scheduled for 1.1 and 1.2 into the 1.0 release, shipped them across builds 10–12, and the entire class of “catch-up shows impossible state” bugs disappeared as a side effect of the refactors rather than as a targeted patch.

The lesson — and this is one of the few times I’m going to be tutorial-mode prescriptive in this post — is write the retrospective doc before you need it. Not as planning. Not as a refactor commitment. As a catch-basin for “this keeps biting me” intuitions, with explicit triggering criteria for when intuition becomes action. Mine sits in the repo at docs/architecture-retrospective.md. When the trigger fires, you don’t have to re-derive the analysis under pressure. You just open the doc and execute the plan you wrote when your head was clear.

I would not have written that doc without Claude. Not because it required AI to write — it didn’t — but because the conversational format of working with Claude generates these documents as a natural side effect of bug-fix sessions. “Tell me what we’re actually fighting here” turns into a doc that I can keep, not a Slack thread that scrolls into oblivion.

The honest regret: two paths that should have been one

Here’s the architectural decision I’d take back if I could.

BaseballScorer can score a game two ways. You can score it by hand, pitch by pitch — the original use case, the one I designed for. Or you can let the app pull from the MLB Stats API and “catch up” to a live game, populating the scorecard from the feed so you can join in mid-game without having to manually backfill the first three innings.

These two paths share almost no code. Manual scoring goes through ScoringViewModel.recordResult / recordPitch / placeRunner and friends. Catch-up goes through MLBAutoFillService.populateFromFeed, which directly mutates the SwiftData models. By the time I noticed this was a problem, both paths had grown enough complexity that unifying them wasn’t a quick refactor.

The cost shows up most clearly in runner advancement. On the manual path, the user has full control — they can move every runner exactly where they need to be. On the catch-up path, if the MLB feed doesn’t surface a runner movement (or we miss one during ingestion), it’s just gone, with no equivalent corrective UI. Two paths, two test surfaces, two places to fix every bug, and a class of “catch-up does X but manual does Y” inconsistencies that I’ve patched at least a dozen times.

If I were starting over, I’d build a typed event log first, and force both paths to produce events that feed a single applier. Both the manual UI and the feed parser would emit the same runnerMovement events; one code path would consume them. The retrospective doc lays this out as a future refactor — possibly worth doing if 1.4’s “Live Game Assistance” theme makes the divergence painful enough — but it would have been trivial to design in on day one and is genuinely hard to refactor in now.

The general lesson, if you want one: when you have two code paths that produce “the same kind of state” through different mechanisms, ask very hard whether they can share a layer. The answer is almost always yes, and almost always you’ll only see how to do it once you’ve already built both.

A few things I would tell you to do

Tutorial mode, briefly, because abstract advice gets nodded at and forgotten:
- Keep custom skills minimal. On my prior Java projects I had eight specialized agents — config-manager, debugging-helper, documentation-writer, framework-developer, performance-optimizer, pipeline-specialist, service-developer, test-writer — each with its own persona prompt. In hindsight: overkill. On BaseballScorer I have five skills (bug-fix, release, commit, testflight-upload, security-review), each tied to a specific recurring multi-step workflow that’s actually painful to do by hand. That’s the right number. If you find yourself writing a skill for “the documentation persona,” that’s a sign your main agent is fine and you’re inventing problems.
- Write the failing test first for bug fixes. Even when Claude is going to write the test for you. The discipline keeps you honest about whether you’re actually fixing the bug or just papering over a symptom. My bug-fix skill enforces this by convention — it won’t write a fix until there’s a test file with test_bugfix_<shortDescription> in it.
- Branch off the buggy build’s tag, not main. When a bug ships in v1.2-b23, the fix branch starts from that tag, not from current main. This guarantees the reproducer test actually reproduces the bug in question, and gives you a clean cherry-pick path back to main once the fix is verified. I thought this was standard practice from my pre-Claude career; I’m told it’s less common than I assumed.
- Make App Store metadata source-of-truth in your repo, not in App Store Connect. I learned this one the hard way. I added some marketing copy (“no ads, no paywall”) directly in App Store Connect and forgot about it. A subsequent fastlane push regenerated the metadata from a doc in my repo and overwrote my edits with no warning. Now docs/app-store-metadata.md is the only thing I touch, and fastlane is the only thing that talks to App Store Connect.
- Write the retrospective doc before you need it. I already preached this one above. I’ll say it again because it’s the highest-ROI habit I’ve adopted on this project.
What’s next

If you’re a baseball scorer — or curious enough about scoring to want to learn — the app is on the App Store, and the companion scoring guide lives at scoring.theyawns.com. The guide is about 20,000 words of “here’s how baseball scoring actually works,” from “what is a 6-4-3?” to the Manager Challenge notation we added in 1.3. If you’re wondering whether to bother learning to score: the app makes it about as low-stakes as it can be, and the guide tries to do the same.

The next post in this series gets into the release workflow — the actual fastlane glue, the custom skills, the gotchas I hit, the time fastlane silently crashed on a non-ASCII byte and I had to learn more about shell locales than I wanted to. The post after that is on the persistent memory system that lets Claude keep coherent context across months of work without me re-explaining the codebase every session. And the final post is the one that’s most specifically for iOS developers: a side-by-side guide to moving from standalone (terminal) Claude Code to the Xcode-integrated version, including the commands and modes that aren’t there and what to do instead. All three will be more concrete and more tutorial-shaped than this one.

That’s where we’ll leave things for today.

Part of an ongoing series at Nodes and Edges. Earlier post in a related vein: Baseball Invented Event Sourcing 150 Years Ago.
June 24, 2026

Hazelcast Transactional Outbox: Guaranteed Delivery

Part 8 in the “Building Event-Driven Microservices with Hazelcast” series

Introduction

In Part 7, we added circuit breakers and retry to protect saga listeners from transient failures on the consumer side. That covers what happens when a service receives an event and can’t process it. But we haven’t talked about what happens when the event never leaves the building.

Quick refresher on our dual-instance architecture: each service runs an embedded Hazelcast instance for local Jet pipeline processing and a client connected to the shared cluster for cross-service ITopic communication. After the pipeline processes an event, the EventSourcingController republishes it to the shared cluster so saga listeners in other services can react.

That republish step? It was a fire-and-forget call:

// The old approach — fragile
try {
    ITopic<GenericRecord> topic = sharedHazelcast.getTopic(pending.eventType);
    topic.publish(pending.eventRecord);
} catch (Exception e) {
    logger.warn("Failed to republish event {}: {}", pending.eventType, e.getMessage());
    // Event is permanently lost!
}

If the shared cluster is unreachable — network partition, cluster restart, someone tripping over the power cable — the event vanishes. The saga never progresses. Eventually the saga timeout detector marks it as failed, but by then the original event data is gone and there’s nothing to retry.

The Transactional Outbox Pattern fixes this. Instead of publishing directly to the shared cluster, the controller writes the event to a local outbox — an IMap on the embedded Hazelcast instance — and a separate publisher component picks it up and delivers it. If delivery fails, the entry stays in the outbox and gets retried.

Why Direct Publishing Fails

The problem is fundamental. Publishing to an external system (the shared cluster) and completing a local operation (the Jet pipeline) are two separate operations that can’t be made atomic.

Failure timeline for direct publishing — the Jet pipeline updates the local event store and materialized view, but the publish to the shared cluster ITopic fails on a network partition and the event is lost with nothing left to retry

The event is safely stored in the local event store and materialized view, but the cross-service notification is lost. You could retry in place, but that blocks the Jet pipeline for all events. You could schedule an async retry, but if the process restarts, that retry state is gone too.

The outbox pattern trades immediate delivery for guaranteed delivery. Write to a durable local store, deliver asynchronously, retry until it works. It’s the standard solution in event-driven architectures for good reason.

Architecture

The outbox IMap lives on the embedded Hazelcast instance — the same instance that hosts the event store and materialized views. Writing to it is a local operation. If the embedded instance is up (and it must be, since the pipeline just ran), the outbox write succeeds.

The OutboxEntry

Each outbox entry captures everything needed to deliver the event later:

public class OutboxEntry {

    private String eventId;          // Matches the domain event's eventId
    private String eventType;        // ITopic name (e.g., "OrderCreated")
    private GenericRecord eventRecord; // The serialized event to publish
    private int retryCount;          // Delivery attempts so far
    private Status status;           // PENDING, DELIVERED, or FAILED
    private Instant createdAt;       // When the entry was created
    private Instant lastAttemptAt;   // When the last delivery attempt occurred
    private String failureReason;    // Most recent failure message

    public enum Status {
        PENDING,    // Awaiting delivery
        DELIVERED,  // Successfully published to shared cluster
        FAILED      // Permanently failed after max retries
    }
}

The eventRecord field is the full GenericRecord that needs to go to the shared cluster’s ITopic — same record the Jet pipeline produces, complete with saga metadata like sagaId and correlationId.

OutboxStore: The Interface

Six methods covering the full lifecycle:

public interface OutboxStore {

    void write(OutboxEntry entry);

    List<OutboxEntry> pollPending(int maxBatchSize);

    void markDelivered(String eventId);

    void markFailed(String eventId, String reason);

    void incrementRetryCount(String eventId, String failureReason);

    long pendingCount();
}

Provider-agnostic. The Hazelcast implementation uses an IMap, but the interface could just as easily sit in front of a database table.

HazelcastOutboxStore

The Hazelcast implementation stores entries as Compact-serialized GenericRecord values in an IMap:

public class HazelcastOutboxStore implements OutboxStore {

    private static final String SCHEMA_NAME = "OutboxEntry";
    private final IMap<String, GenericRecord> outboxMap;

    public HazelcastOutboxStore(HazelcastInstance hazelcast, MeterRegistry meterRegistry) {
        this.outboxMap = hazelcast.getMap(DEFAULT_MAP_NAME);
    }
}

You might wonder why we’re using GenericRecord instead of storing OutboxEntry Java objects directly. The problem is that OutboxEntry has an Instant field and a nested GenericRecord — neither of which Hazelcast’s zero-config Compact serialization can handle. We’d need a custom CompactSerializer registered on every Hazelcast instance configuration. Instead, we convert at the boundary:

static GenericRecord toRecord(final OutboxEntry entry) {
    return GenericRecordBuilder.compact(SCHEMA_NAME)
            .setString("eventId", entry.getEventId())
            .setString("eventType", entry.getEventType())
            .setGenericRecord("eventRecord", entry.getEventRecord())
            .setInt32("retryCount", entry.getRetryCount())
            .setString("status", entry.getStatus().name())
            .setInt64("createdAt", entry.getCreatedAt().toEpochMilli())
            .setNullableInt64("lastAttemptAt",
                    entry.getLastAttemptAt() != null
                            ? entry.getLastAttemptAt().toEpochMilli() : null)
            .setString("failureReason", entry.getFailureReason())
            .build();
}

A few things going on here. Instant becomes int64 epoch millis — compact, sortable, unambiguous. lastAttemptAt uses setNullableInt64 because it’s null until the first delivery attempt. The nested eventRecord uses setGenericRecord, which Compact handles natively. And status is stored as the enum name string, which makes it readable in Management Center and queryable with Predicates.equal().

Polling uses a Hazelcast predicate to filter by status, sorted by creation time so the oldest entries are delivered first:

@Override
public List<OutboxEntry> pollPending(final int maxBatchSize) {
    final Collection<GenericRecord> pending = outboxMap.values(
            Predicates.equal("status", OutboxEntry.Status.PENDING.name()));

    return pending.stream()
            .map(HazelcastOutboxStore::fromRecord)
            .sorted(Comparator.comparing(OutboxEntry::getCreatedAt))
            .limit(maxBatchSize)
            .collect(Collectors.toList());
}

The OutboxPublisher

The publisher bridges the outbox and the shared cluster. The obvious approach is to poll on a fixed interval — once per second, say — but that adds latency we don’t need. We know exactly when a new entry arrives.

Event-Driven Wake-Up

The publisher uses a Semaphore to sleep until someone signals it:

public class OutboxPublisher {

    private final Semaphore wakeUp = new Semaphore(0);

    public void notifyNewEntry() {
        // Release at most 1 permit — avoids unbounded accumulation
        if (wakeUp.availablePermits() == 0) {
            wakeUp.release();
        }
    }

    public boolean waitForWork() {
        try {
            return wakeUp.tryAcquire(
                    properties.getPollInterval().toMillis(),
                    TimeUnit.MILLISECONDS);
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            return false;
        }
    }
}

When the EventSourcingController writes an outbox entry, it calls notifyNewEntry() right after. The publisher wakes up, claims all pending entries, delivers them. Under normal conditions, the time from event creation to shared-cluster delivery is sub-millisecond.

The poll interval (default 1 second) is the safety net. If a signal gets missed — maybe the publisher was busy with a previous batch — the timeout ensures nothing sits around for too long.

This is a JVM-local semaphore, not a distributed one. That’s fine. When the service scales to multiple replicas with per-service clustering (ADR 013), each replica has its own publisher. The semaphore wakes the local publisher instantly for locally-written events. Events written by other replicas get picked up within the poll interval. The actual coordination — preventing two replicas from delivering the same event — happens in claimPending() via an atomic ClaimEntryProcessor on the IMap.

The Publish Loop

public void publishPendingEntries() {
    if (sharedHazelcast == null) {
        if (!noSharedClusterWarningLogged) {
            logger.warn("No shared Hazelcast instance — outbox delivery skipped");
            noSharedClusterWarningLogged = true;
        }
        return;
    }

    List<OutboxEntry> claimed = outboxStore.claimPending(
            properties.getMaxBatchSize(), memberUuid);

    if (claimed.isEmpty()) {
        return;
    }

    for (OutboxEntry entry : claimed) {
        try {
            ITopic<GenericRecord> topic = sharedHazelcast.getTopic(entry.getEventType());
            topic.publish(entry.getEventRecord());
            outboxStore.markDelivered(entry.getEventId());
        } catch (Exception e) {
            if (entry.getRetryCount() + 1 >= properties.getMaxRetries()) {
                outboxStore.markFailed(entry.getEventId(),
                        "Max retries exceeded: " + e.getMessage());
            } else {
                outboxStore.incrementRetryCount(entry.getEventId(), e.getMessage());
            }
        }
    }
}

Note claimPending rather than pollPending. The claiming mechanism uses an EntryProcessor to atomically transition entries from PENDING to CLAIMED, tagging them with the claiming member’s UUID. This prevents two publisher instances from delivering the same event — important once you’re running multiple replicas.

When no shared cluster is configured (single-node dev mode), the publisher logs one warning and stops trying. Events pile up as PENDING in the outbox. They’ll drain as soon as a shared cluster appears.

Retry escalation is per-entry:

Attempt 1: fails → incrementRetryCount (retryCount=1)
Attempt 2: fails → incrementRetryCount (retryCount=2)
...
Attempt 5: fails → markFailed (retryCount=5 >= maxRetries=5)

Once marked FAILED, the entry stops showing up in claim results. The failure reason is preserved for debugging.

Scheduling

OutboxAutoConfiguration hooks the publisher into Spring’s task scheduler:

@EnableScheduling
public class OutboxAutoConfiguration implements SchedulingConfigurer {

    @Override
    public void configureTasks(ScheduledTaskRegistrar taskRegistrar) {
        taskRegistrar.addFixedDelayTask(() -> {
            outboxPublisher.waitForWork();       // blocks until signaled or timeout
            outboxPublisher.publishPendingEntries();
        }, 1);  // 1ms loop delay — actual timing controlled by semaphore
    }
}

The 1ms fixed delay means the loop restarts almost immediately after each cycle, but waitForWork() controls the actual pacing. The thread blocks on the semaphore until either a permit is released or the poll interval elapses. Near-instant delivery under normal load, guaranteed pickup if a signal is missed.

Integration with EventSourcingController

The controller’s republishToSharedCluster now checks for an outbox store first:

private void republishToSharedCluster(PendingCompletion<K> pending) {
    if (sharedHazelcast == null || pending.eventRecord == null || pending.eventType == null) {
        return;
    }
    if (outboxStore != null) {
        OutboxEntry entry = new OutboxEntry(
                pending.completionInfo.getEventId(),
                pending.eventType,
                pending.eventRecord
        );
        outboxStore.write(entry);
        if (outboxPublisher != null) {
            outboxPublisher.notifyNewEntry();
        }
    } else {
        // Legacy direct publish (when outbox is disabled)
        try {
            ITopic<GenericRecord> topic = sharedHazelcast.getTopic(pending.eventType);
            topic.publish(pending.eventRecord);
        } catch (Exception e) {
            logger.warn("Failed to republish event {}: {}", pending.eventType, e.getMessage());
        }
    }
}

Fully backward compatible. When outboxStore is injected, events go through the durable path. When it’s null, you get the old fire-and-forget behavior. The OutboxStore is wired through each service’s config as an optional dependency:

@Bean
public EventSourcingController<Order, String, DomainEvent<Order, String>> orderController(
        HazelcastInstance hazelcastInstance,
        @Qualifier("hazelcastClient") HazelcastInstance hazelcastClient,
        @Autowired(required = false) OutboxStore outboxStore,
        ...) {
    return EventSourcingController.builder()
            .hazelcast(hazelcastInstance)
            .sharedHazelcast(hazelcastClient)
            .outboxStore(outboxStore)
            .build();
}

Delivery Guarantees

The outbox provides at-least-once delivery. If the publisher crashes after publishing to the ITopic but before calling markDelivered(), the next cycle picks up the same entry and delivers it again. Events are never lost as long as the embedded Hazelcast instance’s IMap data is intact.

At-least-once means consumers may see duplicates. That’s where the Idempotency Guard from Part 9 comes in — it deduplicates on the consumer side, complementing the outbox’s guaranteed delivery.

As for ordering: events for the same aggregate are written to the outbox in sequence order (the Jet pipeline processes them sequentially), and claimPending sorts by createdAt. But if two events are pending simultaneously and the first one fails while the second succeeds, they’ll arrive out of order. For our saga use case that’s acceptable — each step is identified by sagaId and eventType, and the saga state machine handles duplicates and out-of-order delivery.

Configuration

framework.outbox.*

Property	Default	Description
enabled	true	Master toggle for the outbox pattern
poll-interval	1000 (ms)	Fallback interval if signal is missed
max-batch-size	50	Maximum entries per poll cycle
max-retries	5	Delivery attempts before permanent failure
entry-ttl	24h	How long DELIVERED entries survive in the map

Metrics

Metric	Type	Description
outbox.entries.written	Counter	Events written to the outbox
outbox.entries.delivered	Counter	Events delivered to shared cluster
outbox.entries.failed	Counter	Events permanently failed
outbox.publish.duration	Timer	Time per publish cycle

To disable the outbox and use direct publishing:

framework:
  outbox:
    enabled: false

What’s Next

The outbox guarantees events reach the shared cluster. But what happens when they get there and the consumer can’t process them? The consumer might crash, the business logic might throw, the circuit breaker might be open.

In Part 9, we add two patterns that work together: a Dead Letter Queue that captures events that fail consumer-side processing, and an Idempotency Guard that prevents duplicate processing — the natural flip side of at-least-once delivery.

Next up: Dead Letter Queues and Idempotency

Previous: Circuit Breakers and Retry: Resilient Hazelcast Sagas

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

June 22, 2026

Circuit Breakers and Retry: Resilient Hazelcast Sagas

Part 7 in the “Building Event-Driven Microservices with Hazelcast” series

Introduction

A commercial airliner doesn’t fall out of the sky when an engine fails. It keeps flying. The remaining engine provides enough thrust to reach the nearest airport, the crew follows a well-rehearsed procedure, and the passengers — ideally — never know how close things got. Aviation engineers figured this out decades ago: you can’t prevent every failure, so you build the system to keep working when parts of it stop. (There’s even a great acronym for it — ETOPS, which officially stands for Extended Twin-engine Operations Performance Standards, but which pilots will tell you really means “Engines Turn Or Passengers Swim.”)

Microservices need the same philosophy. Not because individual services fail as dramatically as a jet engine, but because they fail far more often. A garbage collection pause. A network blip. A downstream provider having a bad day. A deployment rolling through the cluster at 2 AM. In a monolith, these are minor hiccups — the kind of thing you might not even notice in the logs. In a distributed system where five services coordinate through asynchronous events, a hiccup in one service can propagate to all five in the time it takes to brew a cup of coffee.

And the ways things go wrong are… creative. The catalog of distributed system failure modes is large enough to fill a textbook. Several textbooks, actually — and people have. Too many for a single pattern or a single blog post.

So we’re spending the next three posts on resilience. This one covers circuit breakers and retry — protecting saga listeners when downstream services misbehave. Part 8 tackles the transactional outbox pattern, which guarantees events aren’t lost between producer and consumer. And Part 9 adds dead letter queues and idempotency guards — the safety nets for events that fail permanently or arrive more than once. Three different failure modes, three different mechanisms.

Back in Part 4, we built a choreographed saga for order fulfillment. Three services — Inventory, Payment, and Order — coordinate through Hazelcast ITopic events published on a shared cluster. The happy path works beautifully. Without resilience patterns, though, a single struggling service can drag the whole saga down with it. A slow Payment Service fills up the Inventory Service’s thread pool with blocked calls. A transient network error permanently loses an event. A burst of failures overwhelms everything simultaneously.

That’s what we’re fixing.

The Problem: Cascading Failures

Here’s the order fulfillment saga on a good day:

Order fulfillment saga happy path — Inventory, Payment, and Order services exchanging OrderCreated, StockReserved, PaymentProcessed, and OrderConfirmed events over Hazelcast ITopic

Each step is an ITopic message on the shared Hazelcast cluster. Each listener calls a local service method — IMap operations, Jet pipeline processing, further ITopic publishing. Events flow, state updates, everyone’s happy.

Now imagine the Payment Service is having a rough morning. Some downstream payment provider is dragging, and every StockReserved event that arrives takes 30 seconds to process instead of the normal 50 milliseconds. Without any resilience mechanism, here’s what unfolds:

Inventory keeps publishing StockReserved events at the normal rate
Payment’s listener thread pool fills up with slow calls
New events queue behind the blocked threads
ITopic backpressure eventually slows the shared cluster itself
Other listeners on the same cluster — including Inventory and Order — start seeing delays
The entire saga grinds to a halt

One service had a problem. Now every service has a problem. This is a cascade failure, and it’s the defining hazard of distributed architectures. The shared communication fabric that makes coordination possible is the same fabric that propagates failure.

Enter Resilience4j

The patterns we need — circuit breakers, retry with backoff, bulkheads, rate limiters — have been well understood for years. Netflix popularized them in the Java world with Hystrix, which became the standard library for microservice resilience through most of the 2010s. But Netflix put Hystrix into maintenance mode in 2018 and eventually stopped development entirely.

The successor that emerged is Resilience4j. It’s a lightweight fault tolerance library for Java 8+ built around functional composition — you wrap a Supplier or Runnable with decorators, and the decorators handle the resilience logic. It’s not just a circuit breaker library, though that’s what most people know it for. It actually provides six core modules: circuit breaker, retry, bulkhead (resource isolation), rate limiter, time limiter, and cache. Each is standalone. You pick what you need and leave the rest on the shelf.

There are other options — Failsafe is a solid zero-dependency alternative, and Alibaba’s Sentinel targets high-traffic rate limiting scenarios. But Resilience4j has become the de facto choice for Spring Boot microservices. The Spring integration is mature, Micrometer metrics work out of the box, and @ConfigurationProperties binding means your resilience settings live in the same YAML as everything else. For our framework, we’re using two of the six modules: CircuitBreaker and Retry.

Circuit Breakers: Automatic Service Isolation

A circuit breaker does what it sounds like. It monitors the failure rate of an operation and automatically stops calling it when failures exceed a threshold — the same idea as the breaker panel in your house. Too much current flows through the circuit, the breaker trips, the wiring doesn’t catch fire. In our case, “too much current” means too many failed calls, and “the wiring” is every other service sharing that communication path.

Three States

Circuit breaker state machine — CLOSED trips to OPEN when the failure rate crosses the threshold, OPEN moves to HALF-OPEN after the wait duration, and HALF-OPEN returns to CLOSED on success or back to OPEN on failure

CLOSED is normal operation. All calls pass through, and the circuit breaker quietly records outcomes in a sliding window. OPEN means the breaker has tripped — all calls are immediately rejected with a CallNotPermittedException, and no load reaches the downstream service at all. HALF-OPEN is the recovery probe: a limited number of test calls pass through. If they succeed, the breaker returns to CLOSED. If they fail, back to OPEN. Rinse and repeat until the downstream service gets its act together.

The Framework’s ResilientServiceInvoker

Rather than sprinkling Resilience4j decorators at every call site, we centralized everything into ResilientServiceInvoker:

public class ResilientServiceInvoker implements ResilientOperations {

    private final CircuitBreakerRegistry circuitBreakerRegistry;
    private final RetryRegistry retryRegistry;
    private final ResilienceProperties properties;

    public <T> T execute(final String name, final Supplier<T> operation) {
        if (!properties.isEnabled()) {
            return operation.get();
        }

        final CircuitBreaker circuitBreaker = circuitBreakerRegistry.circuitBreaker(name);
        final Retry retry = retryRegistry.retry(name);

        final Supplier<T> decoratedSupplier = CircuitBreaker.decorateSupplier(circuitBreaker,
                Retry.decorateSupplier(retry, operation));

        try {
            return decoratedSupplier.get();
        } catch (CallNotPermittedException e) {
            logger.warn("Circuit breaker '{}' is OPEN — rejecting call", name);
            throw new ResilienceException(
                    "Circuit breaker '" + name + "' is open, call rejected", name, e);
        } catch (Exception e) {
            logger.error("Operation '{}' failed after retries: {}", name, e.getMessage());
            throw new ResilienceException(
                    "Operation '" + name + "' failed after retries", name, e);
        }
    }
}

A few things to notice here. Each call to execute(“inventory-stock-reservation”, …) creates or retrieves a circuit breaker and retry instance with that name. This means each saga step gets its own independent circuit breaker — a payment failure won’t trip the inventory breaker.

The decoration order matters: retry wraps the operation first, then the circuit breaker wraps the retry. So the circuit breaker sees the final outcome after all retries are exhausted. A transient failure that succeeds on the second attempt counts as a success for the circuit breaker. If you stacked them the other way around, every individual failed attempt would register as a circuit breaker failure, and you’d trip the breaker much faster than you intended.

And there’s a kill switch. When framework.resilience.enabled=false, the execute method just calls the operation directly. Zero overhead. This matters for testing and for environments where resilience is handled at a different layer — a service mesh, maybe, or a cloud provider’s load balancer.

The ResilientOperations Interface

We extract an interface from the concrete class:

public interface ResilientOperations {
    <T> T execute(String name, Supplier<T> operation);
    void executeRunnable(String name, Runnable operation);
    <T> CompletableFuture<T> executeAsync(String name, Supplier<CompletableFuture<T>> operation);
}

This is the same workaround we used for ServiceClientOperations in Part 6. Java 25’s Mockito inline mock maker can’t mock concrete classes in certain JVM configurations, so you extract an interface and mock that instead. Not the most glamorous reason to create an abstraction, but it works.

Three Flavors

The invoker supports three calling patterns:

// Synchronous — returns a value
String result = invoker.execute("orderSaga", () -> processEvent(event));

// Fire-and-forget — void operation
invoker.executeRunnable("paymentListener", () -> publishToTopic(event));

// Async — returns CompletableFuture
CompletableFuture<Product> future = invoker.executeAsync("inventory-stock-reservation",
        () -> inventoryService.reserveStockForSaga(productId, quantity, ...));

The async variant is the one our saga listeners actually use — inventory, payment, and order service calls all return CompletableFuture.

Wiring into the Saga Listeners

The saga listeners from Part 4 now inject ResilientOperations as an optional dependency:

@Component
public class InventorySagaListener {

    private final ProductService inventoryService;
    private final HazelcastInstance hazelcast;
    private ResilientOperations resilientServiceInvoker;

    @Autowired(required = false)
    public void setResilientOperations(ResilientOperations resilientServiceInvoker) {
        this.resilientServiceInvoker = resilientServiceInvoker;
    }

That @Autowired(required = false) is doing important work. If resilience is disabled — or if the Resilience4j dependency isn’t even on the classpath — the listener still functions. It just calls the service directly, no wrapping. The saga worked before we added resilience; it should keep working without it.

Each listener has a helper that handles the null check:

private <T> CompletableFuture<T> executeWithResilience(
        final String name, final Supplier<CompletableFuture<T>> operation) {
    if (resilientServiceInvoker != null) {
        return resilientServiceInvoker.executeAsync(name, operation);
    }
    return operation.get();
}

And the actual saga step looks like this:

executeWithResilience("inventory-stock-reservation",
        () -> inventoryService.reserveStockForSaga(
                productId, quantity, orderId, sagaId, correlationId,
                customerId, total, currency, "CREDIT_CARD"
        )
).whenComplete((product, error) -> {
    if (error != null) {
        sendToDeadLetterQueue(record, "OrderCreated", error);
    } else {
        logger.info("Stock reserved for saga: productId={}, quantity={}, orderId={}, sagaId={}",
                productId, quantity, orderId, sagaId);
    }
});

The circuit breaker name inventory-stock-reservation is specific to this saga step. Each step across the three services gets its own name and its own circuit breaker:

Circuit Breaker Name	Saga Step	Service
inventory-stock-reservation	Reserve stock on OrderCreated	Inventory
inventory-stock-release	Release stock on compensation	Inventory
payment-processing	Process payment on StockReserved	Payment
payment-refund	Refund payment on compensation	Payment
order-confirmation	Confirm order on PaymentProcessed	Order
order-cancellation	Cancel order on compensation	Order

Six independent circuit breakers. If payment processing is struggling, the inventory breakers stay closed and keep doing their job.

Retry with Exponential Backoff

Transient failures — network blips, temporary overload, brief GC pauses — are the most common failure mode in distributed systems. Most of them resolve on their own within seconds. Retry is the first line of defense.

The Thundering Herd

But naive retry — retry immediately, same interval, keep hammering — can make things actively worse. Picture this: a service buckles under load, and 100 clients all get errors simultaneously. They all retry at 500ms. The service sees a spike of 100 simultaneous requests. It fails again. They all retry at 1000ms. Another spike. Same result.

This is the thundering herd problem. Everyone backs off at the same fixed interval, and everyone comes stampeding back at the same moment. The retry mechanism that was supposed to help is the thing keeping the service down.

Exponential backoff breaks the herd apart:

Attempt 1: immediate
Attempt 2: wait 500ms
Attempt 3: wait 1000ms  (500ms × 2.0)
Attempt 4: wait 2000ms  (1000ms × 2.0)

The growing intervals give the struggling service breathing room. And because different callers started their retry sequences at slightly different moments, the backoff naturally staggers the waves. Each one arrives smaller and more spread out than the last. The herd thins itself out.

Configuration

The framework exposes all of this through ResilienceProperties:

framework:
  resilience:
    enabled: true
    retry:
      max-attempts: 3
      wait-duration: 500ms
      enable-exponential-backoff: true
      exponential-backoff-multiplier: 2.0

The auto-configuration translates these into a Resilience4j RetryConfig:

@Bean
@ConditionalOnMissingBean
public RetryRegistry retryRegistry(final ResilienceProperties properties) {
    final ResilienceProperties.RetryProperties retryProps = properties.getRetry();

    final RetryConfig.Builder<?> builder = RetryConfig.custom()
            .maxAttempts(retryProps.getMaxAttempts())
            .retryOnException(e -> !(e instanceof NonRetryableException));

    if (retryProps.isEnableExponentialBackoff()) {
        builder.intervalFunction(IntervalFunction
                .ofExponentialBackoff(
                        retryProps.getWaitDuration(),
                        retryProps.getExponentialBackoffMultiplier()));
    } else {
        builder.waitDuration(retryProps.getWaitDuration());
    }

    return RetryRegistry.of(builder.build());
}

Two things to note. The retryOnException predicate excludes NonRetryableException — we’ll get to that in a moment. And when enable-exponential-backoff is false, it falls back to a fixed interval between attempts.

NonRetryableException: When to Stop Trying

Not every failure is transient. “Payment declined” will never succeed on retry — the credit card is invalid. “Insufficient stock” is deterministic — the warehouse genuinely doesn’t have the product. Retrying these wastes time, wastes resources, and — if the circuit breaker is counting — burns through your failure budget for no reason.

The framework defines a marker interface:

public interface NonRetryableException {
    // Marker interface — business exceptions implement this to skip retry
}

Service exceptions opt in:

public class InsufficientStockException extends RuntimeException
        implements NonRetryableException {
    public InsufficientStockException(String message) {
        super(message);
    }
}

public class PaymentDeclinedException extends RuntimeException
        implements NonRetryableException {
    public PaymentDeclinedException(String message) {
        super(message);
    }
}

Why a marker interface instead of a base class? Because these exceptions already extend RuntimeException. Java doesn’t have multiple inheritance, but it does have multiple interfaces. The marker lets any exception opt out of retry without changing its class hierarchy.

The retry configuration’s predicate is one line:

.retryOnException(e -> !(e instanceof NonRetryableException))

When retry encounters one of these, it fails immediately. No backoff, no additional attempts. But the circuit breaker still records it as a failure — it still counts toward the failure rate threshold. This is the right behavior. If a service is returning “payment declined” for every single request, something is systematically wrong, and the circuit breaker should trip.

Retry Observability

Resilience4j publishes events for every retry attempt, and the framework hooks into them for structured logging and a custom metric:

public class RetryEventListener {

    public RetryEventListener(final RetryRegistry retryRegistry,
                              final MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;

        retryRegistry.getAllRetries().forEach(this::registerListeners);
        retryRegistry.getEventPublisher().onEntryAdded(
                event -> registerListeners(event.getAddedEntry()));
    }

    private void registerListeners(final Retry retry) {
        final var eventPublisher = retry.getEventPublisher();
        eventPublisher.onRetry(this::onRetry);
        eventPublisher.onSuccess(this::onSuccess);
        eventPublisher.onError(this::onError);
        eventPublisher.onIgnoredError(this::onIgnoredError);
    }
}

Four event types give you the full picture:

Event	Log Level	What happened
onRetry	WARN	An attempt failed, trying again
onSuccess	INFO	Eventually succeeded
onError	ERROR	All retries exhausted
onIgnoredError	INFO	Non-retryable, skipped retry

That last one — onIgnoredError — needed a custom Micrometer counter because Resilience4j’s built-in TaggedRetryMetrics doesn’t track ignored errors:

private void onIgnoredError(final RetryOnIgnoredErrorEvent event) {
    logger.info("Non-retryable exception for '{}', skipping retry: {}",
            event.getName(), event.getLastThrowable().getMessage());

    Counter.builder("framework.resilience.retry.ignored")
            .description("Count of non-retryable exceptions that skipped retry")
            .tag("name", event.getName())
            .register(meterRegistry)
            .increment();
}

In practice, the logs tell you a clear story. A transient failure that recovers:

WARN  RetryEventListener - Retry attempt #1 for 'payment-processing': Connection refused
WARN  RetryEventListener - Retry attempt #2 for 'payment-processing': Connection refused
INFO  RetryEventListener - 'payment-processing' succeeded after 2 attempt(s)

A business exception that gets kicked straight to the dead letter queue:

INFO  RetryEventListener - Non-retryable exception for 'payment-processing',
      skipping retry: Insufficient funds for amount 15000.00

The ResilienceException Wrapper

When an operation exhausts all retries or gets rejected by an open circuit breaker, the framework wraps the failure in a ResilienceException:

public class ResilienceException extends RuntimeException {

    private final String operationName;

    public ResilienceException(String message, String operationName, Throwable cause) {
        super(message, cause);
        this.operationName = operationName;
    }
}

The operationName field tells downstream handlers which circuit breaker failed. The dead letter queue integration (Part 9) uses this to classify failures:

if (error instanceof ResilienceException) {
    logger.warn("Circuit breaker open, saga step deferred: eventId={}", eventId);
} else {
    logger.error("Failed to process event: {}", eventId, error);
}

Auto-Configuration

The whole resilience stack is wired through a single auto-configuration class:

@Configuration
@ConditionalOnClass(CircuitBreakerRegistry.class)
@ConditionalOnProperty(name = "framework.resilience.enabled", matchIfMissing = true)
@EnableConfigurationProperties(ResilienceProperties.class)
public class ResilienceAutoConfiguration {

    @Bean @ConditionalOnMissingBean
    public CircuitBreakerRegistry circuitBreakerRegistry(ResilienceProperties properties) { ... }

    @Bean @ConditionalOnMissingBean
    public RetryRegistry retryRegistry(ResilienceProperties properties) { ... }

    @Bean @ConditionalOnMissingBean
    public ResilientServiceInvoker resilientServiceInvoker(...) { ... }

    @Bean @ConditionalOnMissingBean(TaggedCircuitBreakerMetrics.class)
    public TaggedCircuitBreakerMetrics taggedCircuitBreakerMetrics(...) { ... }

    @Bean @ConditionalOnMissingBean(TaggedRetryMetrics.class)
    public TaggedRetryMetrics taggedRetryMetrics(...) { ... }

    @Bean @ConditionalOnMissingBean
    public RetryEventListener retryEventListener(...) { ... }
}

Three conditionals control activation. @ConditionalOnClass(CircuitBreakerRegistry.class) means the whole thing only activates when Resilience4j is on the classpath — services that don’t include the dependency don’t get any resilience beans. @ConditionalOnProperty(…, matchIfMissing = true) means it’s enabled by default; set framework.resilience.enabled=false to turn it off. And every individual bean is @ConditionalOnMissingBean, so the application can override any piece by defining its own bean.

Six beans total:

CircuitBreakerRegistry — circuit breaker instances, configured from properties
RetryRegistry — retry instances with optional exponential backoff
ResilientServiceInvoker — the decorator that wraps operations
TaggedCircuitBreakerMetrics — binds circuit breaker metrics to Micrometer
TaggedRetryMetrics — binds retry metrics to Micrometer
RetryEventListener — structured logging and the custom ignored-error counter

Per-Instance Tuning

Different saga steps have different tolerance for failure. Stock reservation should be fast and reliable — if it’s failing, something is seriously wrong, and we want the circuit to trip quickly. Payment processing, on the other hand… payment providers are notoriously flaky. You’d rather tolerate a higher failure rate and give the provider more time to sort itself out before you start rejecting everything.

The framework supports per-instance overrides in each service’s application.yml:

framework:
  resilience:
    enabled: true
    circuit-breaker:
      failure-rate-threshold: 50
      wait-duration-in-open-state: 10s
      sliding-window-size: 10
      minimum-number-of-calls: 5
      permitted-number-of-calls-in-half-open-state: 3
    retry:
      max-attempts: 3
      wait-duration: 500ms
      enable-exponential-backoff: true
      exponential-backoff-multiplier: 2.0
    instances:
      inventory-stock-reservation:
        circuit-breaker:
          failure-rate-threshold: 40
          wait-duration-in-open-state: 5s
        retry:
          max-attempts: 2
      payment-processing:
        circuit-breaker:
          failure-rate-threshold: 60
          wait-duration-in-open-state: 15s
        retry:
          max-attempts: 5
          wait-duration: 1s

The instances map lets any named circuit breaker override the defaults:

public CircuitBreakerProperties getCircuitBreakerForInstance(final String name) {
    final InstanceProperties instance = instances.get(name);
    if (instance != null && instance.getCircuitBreaker() != null) {
        return instance.getCircuitBreaker();
    }
    return circuitBreaker; // Fall back to defaults
}

So in this configuration, inventory-stock-reservation trips at 40% failure rate with a 5-second open state and only 2 retry attempts — stock checks are idempotent and fast, no point dragging things out. payment-processing tolerates 60% failure rate with a 15-second open state and 5 retries starting at 1-second intervals. With exponential backoff, that last attempt waits about 16 seconds. Payment providers get the patience they’ve trained us to give them.

Metrics and Monitoring

The auto-configuration binds circuit breaker and retry metrics to Micrometer, which exports to Prometheus for Grafana dashboards:

Circuit Breaker Metrics

Metric	Type	Description
resilience4j_circuitbreaker_state	Gauge	Current state (0=CLOSED, 1=OPEN, 2=HALF_OPEN)
resilience4j_circuitbreaker_calls_total	Counter	Total calls by outcome (successful, failed, not_permitted)
resilience4j_circuitbreaker_failure_rate	Gauge	Current failure rate percentage
resilience4j_circuitbreaker_buffered_calls	Gauge	Calls in sliding window

Retry Metrics

Metric	Type	Description
resilience4j_retry_calls_total	Counter	Total calls by outcome (successful_without_retry, successful_with_retry, failed_with_retry, failed_without_retry)
framework.resilience.retry.ignored	Counter	Non-retryable exceptions (tagged by name)

These feed into Grafana panels for saga health — circuit breaker state timeline showing when breakers trip and recover, retry rate over time where a spike tells you something transient is happening, failure rate broken out by saga step so you can see which one is misbehaving, and the non-retryable exception count that separates business logic failures from infrastructure problems.

Configuration Reference

framework.resilience.*

Property	Default	Description
enabled	true	Master toggle for all resilience features
circuit-breaker.failure-rate-threshold	50	Failure rate (%) to trip the breaker
circuit-breaker.wait-duration-in-open-state	10s	How long to stay open before testing
circuit-breaker.sliding-window-size	10	Number of calls in the measurement window
circuit-breaker.sliding-window-type	COUNT_BASED	COUNT_BASED or TIME_BASED
circuit-breaker.minimum-number-of-calls	5	Minimum calls before evaluating failure rate
circuit-breaker.permitted-number-of-calls-in-half-open-state	3	Test calls in half-open state
retry.max-attempts	3	Maximum retry attempts (including initial)
retry.wait-duration	500ms	Base wait between retries
retry.enable-exponential-backoff	true	Use exponential backoff
retry.exponential-backoff-multiplier	2.0	Backoff multiplier
instances.<name>.circuit-breaker.*	(defaults)	Per-instance circuit breaker overrides
instances.<name>.retry.*	(defaults)	Per-instance retry overrides

What’s Next

Circuit breakers and retry handle one category of failure: transient problems during event consumption. The saga listener tries, the call fails, the retry policy kicks in, the circuit breaker keeps the damage from spreading. That covers the consumer side.

But what about the producer side? When EventSourcingController needs to republish an event to the shared cluster and the cluster is temporarily unreachable, the event just… vanishes. No retry. No circuit breaker. Gone.

That’s a different failure mode, and it needs a different mechanism. In Part 8, we add the transactional outbox pattern — a durable buffer between event production and cross-cluster delivery that guarantees no events are lost, even when the shared cluster is down. Then Part 9 closes the loop with dead letter queues and idempotency guards for events that exhaust all retries or arrive more than once.

Next up: The Transactional Outbox Pattern with Hazelcast

Previous: MCP Server for Microservices: AI-Powered Debugging

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

June 15, 2026

MCP Server for Microservices: AI-Powered Debugging

Part 6 in the “Building Event-Driven Microservices with Hazelcast” series

Introduction

Over the first five articles, we built an event sourcing framework, a Jet pipeline, materialized views, a choreographed saga pattern, and vector similarity search. That’s a lot of infrastructure. It also means that investigating a problem — say, a failed saga — involves chaining together five or six curl commands across four different services, reading JSON output with your eyes, extracting IDs by hand, and constructing the next request.

Which is fine. It’s what we’ve always done. But there’s a better option now.

The Model Context Protocol (MCP) is an open standard that lets AI assistants — Claude, ChatGPT, Copilot, whoever — call tools exposed by external servers. Instead of the assistant guessing at curl commands or asking you to copy-paste output, it directly queries your materialized views, submits events, inspects saga state, and runs demo scenarios.

In this article, we build an MCP server that bridges AI assistants to our eCommerce microservices. And yes, there is something a little meta about using Claude to build a framework and then building a bridge so Claude can operate the framework. We’re going with it.

Why Give an AI Access to Your Microservices?

Consider a typical debugging session. A saga has failed, and you want to know why:

# Step 1: Find failed sagas
curl http://localhost:8083/api/sagas?status=FAILED

# Step 2: Copy a saga ID from the JSON output
curl http://localhost:8083/api/sagas/saga-a7f3e2

# Step 3: Check the order that triggered it
curl http://localhost:8083/api/orders/ord-12345

# Step 4: Check the event history
curl http://localhost:8083/api/orders/ord-12345/events

# Step 5: Check if stock was released as part of compensation
curl http://localhost:8082/api/products/prod-67890

Five commands. Each one requires reading JSON output, finding the right ID, and constructing the next request. You’re doing the orchestration in your head, and — let’s be honest — that’s exactly the kind of tedious mechanical chaining that humans are bad at and computers are good at.

With MCP, the same investigation is a single sentence:

“Why did the most recent saga fail?”

The AI calls list_sagas(status=”FAILED”), then inspect_saga(sagaId=”saga-a7f3e2″), then get_event_history(aggregateId=”ord-12345″, aggregateType=”Order”), interprets all the responses, and gives you a summary:

“Saga saga-a7f3e2 failed at the payment step. Order ORD-12345 had a total of $15,000 which exceeded the $10,000 payment limit. Compensation ran successfully — stock for product PROD-67890 was released.”

Five tool calls, zero curl commands, a root-cause analysis, and a recommendation. From one question.

What Is MCP?

MCP (Model Context Protocol) is an open specification by Anthropic that defines a standard interface between AI assistants and external tools. Think of it as a contract:

MCP protocol sequence: the AI assistant sends tools/list and tools/call to the MCP server, which returns tool definitions and JSON results over JSON-RPC

The protocol uses JSON-RPC 2.0 over one of two transports:

Transport	How It Works	Best For
stdio	AI assistant launches the server as a subprocess; communicates via stdin/stdout	Local development with Claude Code or Claude Desktop
SSE (HTTP)	Server runs as a web service; AI connects over HTTP with Server-Sent Events	Docker, remote deployment, multi-user

The AI assistant doesn’t need to know anything about Hazelcast, Jet pipelines, or event sourcing. It sees ten tools with descriptions and parameters. The MCP server handles the translation between “query the customer view” and “GET http://account-service:8081/api/customers.”

Designing Tools Around Event Sourcing

The hardest part of building an MCP server isn’t the protocol — it’s deciding what tools to expose. Too many and the AI gets confused about which one to use. Too few and it can’t do useful work. We went back and forth on this and started with seven, organized around the three concerns of an event-sourced system. Three more got added later for dead letter queue recovery, which we’ll get to in a moment.

Queries (Read Current State)

Tool	What It Does
query_view	Read materialized views — current state of customers, products, orders, payments
get_event_history	Read the event log — how an entity reached its current state

These map to the read side of CQRS. Views give you the “what,” event history gives you the “why.”

Commands (Produce New Events)

Tool	What It Does
submit_event	Create customers, products, orders; cancel orders; process payments; refund payments
run_demo	Execute multi-step scenarios (happy path, payment failure, saga timeout, sample data)

Each command produces domain events that flow through the Jet pipeline. run_demo chains multiple commands together to set up investigation targets — a failed payment saga, a timeout scenario, a happy path to compare against.

Observability (Inspect the System)

Tool	What It Does
inspect_saga	View a saga’s status, steps completed, timing, and failure reason
list_sagas	Browse sagas filtered by status
get_metrics	Aggregated system metrics — saga counts, event throughput, active gauges

Dead Letter Queue (Investigate and Replay Failures)

Tool	What It Does
list_dlq_entries	List failed events that landed in the dead letter queue, with a pending-count summary for quick triage
inspect_dlq_entry	View a single DLQ entry: event data, failure reason, saga context, replay count
replay_dlq_entry	Republish a DLQ entry’s event for reprocessing — after the cause is fixed

We hadn’t built the DLQ machinery yet when the MCP server first shipped, so these three were added later. The investigation workflow — list, inspect, then decide to replay or not — turned out to map cleanly onto how a human operator works through a queue of failed events. Asking the AI to walk that with you, one entry at a time, is dramatically less tedious than the curl version.

Ten tools, four categories, no overlap. The AI handles any reasonable question about the system, and tool selection stays reliable — you’d never call get_metrics when you meant query_view, or list_dlq_entries when you meant list_sagas. The shape of the tool decides which question it answers.

Architecture: A Pure REST Proxy

The MCP server sits between the AI assistant and the microservices:

MCP server architecture: an AI assistant connects via the MCP protocol to a Spring Boot MCP server on port 8085, which proxies REST calls to the Account, Inventory, Order, and Payment services

We made a deliberate choice here: the MCP server has no Hazelcast dependency. It doesn’t join any cluster, doesn’t read IMaps, doesn’t run Jet jobs. It’s a thin REST proxy that translates MCP tool calls into HTTP requests against the existing service APIs.

Why go to the trouble of keeping them separate? Because coupling the MCP server to Hazelcast would mean classpath conflicts with the services, a dependency on the data layer that makes testing painful, and another component that needs Hazelcast configuration. As a pure proxy, the server needs maybe 128-256 MB of heap, has no classpath conflicts, and you can test every tool by mocking REST responses without running a single service.

Implementation

The ServiceClient

All HTTP communication goes through one class:

@Component
public class ServiceClient implements ServiceClientOperations {

    private final McpServerProperties properties;
    private final RestClient restClient;

    public Map<String, Object> getEntity(String viewName, String id) {
        String url = resolveUrl(viewName) + "/" + id;
        String json = restClient.get().uri(url).retrieve().body(String.class);
        return parseMap(json);
    }

    String resolveUrl(String viewName) {
        return switch (viewName.toLowerCase()) {
            case "customer" -> properties.getAccountUrl() + "/api/customers";
            case "product"  -> properties.getInventoryUrl() + "/api/products";
            case "order"    -> properties.getOrderUrl() + "/api/orders";
            case "payment"  -> properties.getPaymentUrl() + "/api/payments";
            default -> throw new IllegalArgumentException("Unknown view: " + viewName);
        };
    }
}

That resolveUrl switch is the only place that knows which service owns which view. Every tool delegates to ServiceClient rather than making HTTP calls directly.

The ServiceClientOperations interface exists because Mockito’s inline mock maker on Java 25 cannot mock concrete classes. We hit this wall across the framework — the solution every time was to extract an interface so tests can mock it. It’s a slightly annoying pattern, but it works.

A Tool Implementation

Each tool is a Spring @Service with a @Tool-annotated method. Here’s QueryViewTool:

@Service
public class QueryViewTool {

    private final ServiceClientOperations serviceClient;

    @Tool(description = "Query a materialized view. "
            + "Available views: customer, product, order, payment. "
            + "Provide a key to get a specific entity, or omit to list entities.")
    public String queryView(
            @ToolParam(description = "View to query: customer, product, order, or payment")
            String viewName,
            @ToolParam(description = "Optional: specific entity ID", required = false)
            String key,
            @ToolParam(description = "Max results when listing (default: 10)", required = false)
            Integer limit) {

        if (key != null && !key.isBlank()) {
            return toJson(serviceClient.getEntity(viewName, key));
        } else {
            int effectiveLimit = (limit != null && limit > 0) ? limit : 10;
            List<Map<String, Object>> results = serviceClient.listEntities(viewName, effectiveLimit);
            return toJson(Map.of(
                    "view", viewName,
                    "count", results.size(),
                    "entities", results
            ));
        }
    }
}

That @Tool description is doing real work. The AI reads it to decide which tool to call and what parameters to provide. If you’re vague — “query data” instead of “Query a materialized view. Available views: customer, product, order, payment” — the AI picks the wrong tool or provides wrong parameters. We learned this the hard way. Be specific. Name the available views. Explain what happens with versus without a key.

The optional parameters with defaults matter too. When the AI omits key, the tool lists entities. When it omits limit, you get 10. This lets a single tool handle “show me all customers” and “look up customer cust-123” without the AI needing to figure out everything every time.

Tool Registration

All ten tools get registered in one place:

@Configuration
public class McpToolConfig {

    @Bean
    public ToolCallbackProvider mcpTools(QueryViewTool queryView,
                                         SubmitEventTool submitEvent,
                                         GetEventHistoryTool getEventHistory,
                                         InspectSagaTool inspectSaga,
                                         ListSagasTool listSagas,
                                         GetMetricsTool getMetrics,
                                         RunDemoTool runDemo,
                                         ListDlqEntriesTool listDlqEntries,
                                         InspectDlqEntryTool inspectDlqEntry,
                                         ReplayDlqEntryTool replayDlqEntry) {
        return MethodToolCallbackProvider.builder()
                .toolObjects(queryView, submitEvent, getEventHistory,
                        inspectSaga, listSagas, getMetrics, runDemo,
                        listDlqEntries, inspectDlqEntry, replayDlqEntry)
                .build();
    }
}

Spring AI’s MethodToolCallbackProvider scans each object for @Tool methods and registers them with the MCP server. When the AI calls tools/list, it gets back all ten tool definitions with their descriptions and parameter schemas.

The Event Dispatch Pattern

SubmitEventTool deserves a closer look because it maps a single tool to seven different service endpoints:

Map<String, Object> dispatch(String eventType, Map<String, Object> payload) {
    return switch (eventType) {
        case "CreateCustomer"  -> serviceClient.createEntity("customer", payload);
        case "CreateProduct"   -> serviceClient.createEntity("product", payload);
        case "CreateOrder"     -> serviceClient.createEntity("order", payload);
        case "CancelOrder"     -> {
            String orderId = requireField(payload, "orderId");
            yield serviceClient.performAction("order", orderId, "cancel", payload, true);
        }
        case "ReserveStock"    -> {
            String productId = requireField(payload, "productId");
            yield serviceClient.performAction("product", productId, "stock/reserve", payload, false);
        }
        case "ProcessPayment"  -> serviceClient.createEntity("payment", payload);
        case "RefundPayment"   -> {
            String paymentId = requireField(payload, "paymentId");
            yield serviceClient.performAction("payment", paymentId, "refund", payload, false);
        }
        default -> throw new IllegalArgumentException("Unknown event type: " + eventType);
    };
}

The alternative would be seven separate tools — create_customer, create_product, and so on. We went with a single submit_event tool with an eventType discriminator because it mirrors the event sourcing model (the system is event-driven, the tool should feel event-driven), it keeps the total tool count at ten instead of sixteen, and the AI handles the dispatch naturally. When you say “create a customer named Alice,” it maps that to eventType=”CreateCustomer” without difficulty.

The Demo Tool

RunDemoTool is the most complex tool because each scenario chains multiple service calls:

private Map<String, Object> runHappyPath() {
    // Step 1: Create customer
    Map<String, Object> customer = serviceClient.createEntity("customer", Map.of(
            "name", "Demo Customer",
            "email", "demo-" + shortId() + "@example.com",
            "address", "123 Demo Street"
    ));

    // Step 2: Create product
    Map<String, Object> product = serviceClient.createEntity("product", Map.of(
            "sku", "DEMO-" + shortId(),
            "name", "Demo Widget",
            "price", "29.99",
            "quantityOnHand", 100
    ));

    // Step 3: Create order (uses IDs from previous steps)
    String customerId = extractId(customer, "customerId");
    String productId = extractId(product, "productId");
    Map<String, Object> order = serviceClient.createEntity("order", Map.of(
            "customerId", customerId,
            "customerName", "Demo Customer",
            "lineItems", List.of(Map.of(
                    "productId", productId,
                    "productName", "Demo Widget",
                    "quantity", 2,
                    "unitPrice", 29.99
            ))
    ));

    return Map.of("scenario", "happy_path", "steps", List.of(...));
}

Each scenario uses shortId() — a UUID fragment — so you can run the same scenario multiple times without naming collisions. The payment_failure scenario creates a $16,500 order that exceeds the $10,000 payment limit, triggering saga compensation. The saga_timeout scenario creates an order with minimal stock, designed to hit the deadline. These are pre-built investigation targets — the AI equivalent of a test fixture.

Stdio vs. SSE: Two Transport Modes

Default: stdio (Local Development)

# application.properties
spring.main.web-application-type=none
spring.ai.mcp.server.name=ecommerce-mcp-server

The AI assistant launches the server as a subprocess and communicates via stdin/stdout using JSON-RPC:

stdio transport: Claude Code spawns the MCP server as a java -jar subprocess and communicates over stdin and stdout using JSON-RPC 2.0

No network port needed. This is the default for local development with Claude Code or Claude Desktop.

Docker: SSE/HTTP (Networked Deployment)

# application-docker.properties
spring.main.web-application-type=servlet
spring.ai.mcp.server.stdio=false
server.port=8085

In Docker, the MCP server runs as a web service with Server-Sent Events on port 8085:

mcp-server:
  build: ../mcp-server
  ports:
    - "8085:8085"
  environment:
    - SPRING_PROFILES_ACTIVE=docker
    - MCP_SERVICES_ACCOUNT_URL=http://account-service:8081
    - MCP_SERVICES_INVENTORY_URL=http://inventory-service:8082
    - MCP_SERVICES_ORDER_URL=http://order-service:8083
    - MCP_SERVICES_PAYMENT_URL=http://payment-service:8084

The profile switch is the only difference between the two modes. Same tool code, same behavior.

Testing

Each tool has unit tests that mock ServiceClientOperations:

@ExtendWith(MockitoExtension.class)
class QueryViewToolTest {

    @Mock
    private ServiceClientOperations serviceClient;

    private QueryViewTool queryViewTool;

    @BeforeEach
    void setUp() {
        queryViewTool = new QueryViewTool(serviceClient);
    }

    @Test
    void shouldQueryByKey() throws JsonProcessingException {
        when(serviceClient.getEntity("customer", "c1"))
                .thenReturn(Map.of("customerId", "c1", "name", "Alice"));

        String result = queryViewTool.queryView("customer", "c1", null);

        verify(serviceClient).getEntity("customer", "c1");
        Map<String, Object> parsed = objectMapper.readValue(result, new TypeReference<>() {});
        assertNotNull(parsed.get("customerId"));
    }
}

Eleven test classes cover all ten tools plus the ServiceClient. Add another six for the security layer (more on that below) and one integration suite, and the mcp-server module sits at 143 tests total.

Integration tests use Spring’s ApplicationContextRunner to verify bean wiring without starting the MCP stdio transport (which would block in a test environment):

@DisplayName("MCP Tool Integration")
class McpToolIntegrationTest {

    private final ApplicationContextRunner contextRunner = new ApplicationContextRunner()
            .withConfiguration(AutoConfigurations.of(McpToolConfig.class))
            .withUserConfiguration(TestServiceClientConfig.class)
            .withBean(McpServerProperties.class);

    @Test
    void shouldCreateAllToolBeans() {
        contextRunner.run(context -> {
            assertThat(context).hasSingleBean(QueryViewTool.class);
            assertThat(context).hasSingleBean(SubmitEventTool.class);
            // ... all 10 tools
        });
    }

    @Test
    void shouldRegisterToolCallbackProvider() {
        contextRunner.run(context -> {
            ToolCallbackProvider provider = context.getBean(ToolCallbackProvider.class);
            assertThat(provider.getToolCallbacks()).hasSize(10);
        });
    }
}

Configuration

The MCP server has exactly four configuration properties:

mcp.services.account-url=http://localhost:8081
mcp.services.inventory-url=http://localhost:8082
mcp.services.order-url=http://localhost:8083
mcp.services.payment-url=http://localhost:8084

In Docker, these are overridden by environment variables pointing to container hostnames. That’s it. No Hazelcast configuration, no cluster membership, no pipeline setup.

Setting Up the AI Assistant

Claude Code

Add to ~/.claude/claude_code_config.json:

{
  "mcpServers": {
    "hazelcast-ecommerce": {
      "command": "java",
      "args": ["-jar", "mcp-server/target/mcp-server-1.0.0-SNAPSHOT.jar"]
    }
  }
}

Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "hazelcast-ecommerce": {
      "command": "java",
      "args": ["-jar", "/absolute/path/to/mcp-server-1.0.0-SNAPSHOT.jar"]
    }
  }
}

Docker (SSE)

Point the AI assistant to the SSE endpoint: http://localhost:8085/mcp

In all three cases, the AI discovers the ten tools via tools/list and starts using them immediately.

A Real Conversation

Here’s what a debugging session looks like with MCP connected:

You: “Something’s wrong with recent orders. Can you check?”

Assistant: calls list_sagas(status=”FAILED”)

calls get_metrics()

“I found 2 failed sagas out of 142 total. Your overall success rate is 94.4%. Let me investigate the failures.”

calls inspect_saga(sagaId=”saga-fail-1″)

calls inspect_saga(sagaId=”saga-fail-2″)

“Both failures are payment rejections:

saga-fail-1 (Order ORD-088): Total $15,000 exceeded the $10,000 payment limit. Stock was released via compensation.
saga-fail-2 (Order ORD-091): Total $16,500, same payment limit issue. Stock also released.

These aren’t bugs — the payment service is correctly rejecting orders above the limit. You might want to either raise the limit or add client-side validation to prevent orders over $10,000 from being submitted.”

Five tool calls, zero curl commands, and a root-cause analysis with a recommendation. From one question. I’ll be honest — the first time I watched the AI chain together the right sequence of calls and arrive at a correct diagnosis, it felt a little eerie. Like watching someone drive your car better than you do.

Authentication and Tool Authorization

The first version of this server had no authentication, which is fine for local development and obviously not fine for anything else. So we’ve added API key authentication and role-based tool access — disabled by default to preserve backward compatibility, and enabled with a single property when you need it.

mcp:
  security:
    enabled: true
    api-keys:
      viewer-key-12345: VIEWER
      operator-key-67890: OPERATOR
      admin-key-99999: ADMIN

In HTTP/SSE mode the key arrives in the X-API-Key request header. In stdio mode it’s read from the MCP_API_KEY environment variable. Either way, the server resolves the key to a role, and a ToolAuthorizer checks whether the role is permitted to invoke the tool the AI just asked for.

Three roles are defined:

VIEWER — Read-only. Can call query_view, get_event_history, inspect_saga, list_sagas, get_metrics, list_dlq_entries, and inspect_dlq_entry. Cannot modify state.
OPERATOR — Read plus write. Adds submit_event, run_demo, and replay_dlq_entry.
ADMIN — Same as OPERATOR today, reserved for future admin-only tools.

run_demo is a good example of why the role split matters — it’s the kind of tool you absolutely do not want firing in production, and the default VIEWER key keeps that off the table. The viewer can do everything an SRE wants to do during an incident — query, inspect, look at metrics — but it can’t accidentally place an order.

One layer is still missing: the MCP server authenticates its callers, but it doesn’t forward caller identity to the downstream microservices. For a real production deployment you’d want both. We’ll come back to that.

Where This Goes Next

A few directions we haven’t explored yet.

MCP supports streaming responses, which we’d want for large result sets — listing thousands of events as a single JSON blob isn’t great. MCP also has resources, read-only data endpoints that the AI can reference as context without explicitly calling a tool. The materialized views are a natural fit for that.

OAuth forwarding is the gap mentioned above — the MCP server’s caller identity needs to propagate down to the backend services if we want end-to-end auth in production. The plumbing exists in Spring Security; we just haven’t wired it up.

And with the MCP server as a foundation, you could build specialized AI agents — an operations agent that monitors sagas and flags anomalies, a demo agent that walks users through the system, a testing agent that creates targeted test data and verifies compensation paths. We haven’t built any of these yet, but the tool layer is there.

The MCP server adds a natural-language interface to everything we’ve built so far. Ten tools, a thin REST proxy, two transport modes, role-based authorization, 143 tests. It doesn’t add new capabilities to the data layer — it makes the existing capabilities accessible through conversation. And that turns out to matter more than it sounds like it should. The investigation that took five curl commands now takes one sentence. The demo that required a script and documentation now requires “show me the happy path.” The system that was only inspectable by people who knew the API endpoints is now inspectable by anyone who can ask a question.

That’s where we’ll leave things for today.

Next up: Circuit Breakers and Retry for Saga Resilience

Previous: Vector Similarity Search with Hazelcast

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

June 8, 2026

Author: Mike

How It Fits Together

Instrumenting the Framework

Metrics Architecture

Pipeline Metrics

Saga Metrics

Beyond the Core

Caching Metric Instances

JVM and System Metrics

Configuring Prometheus

Service-Side: Spring Boot Actuator

Prometheus-Side: Scrape Configuration

Grafana Dashboards

Dashboard Strategy

Auto-Provisioning

Key Dashboard Panels

System Overview

Saga Dashboard

Event Flow

Business Overview

Alerting

Pre-Configured Alerts

Saga Alerts

Service Health Alerts

Distributed Tracing with Jaeger

Configuration

Useful PromQL Queries

Service Health

Event Pipeline

Sagas

Business

JVM

Lessons Learned

The Problem

Why MapStore?

Architecture

Provider-Agnostic Interfaces (framework-core)

MapStore Adapters (framework-core)

PostgreSQL Implementation (framework-postgres)

In-Memory Fallback (framework-core)

How Write-Behind Works

MapStore Behavior by Map Type

Bounded Memory with Eviction

The DLQ Exception: Direct Writes Instead of MapStore

Metrics and Observability

Zero Code Changes in Services

Custom Providers

What We Built

Configuration Reference

The memory hierarchy

When should you clear the context?

Does compaction hurt accuracy?

How much belongs in CLAUDE.md?

Memory vs. CLAUDE.md vs. the prompt

Memory is not live state

The payoff: judgment that compounds

Tutorial mode, briefly

What’s next

When to Use Which

Architecture Comparison

Choreographed Flow

Orchestrated Flow

Implementation Comparison

Choreography: Saga Listeners

Orchestration: SagaDefinition Builder

The Orchestrator State Machine

Per-Step Retry

Why HTTP Instead of ITopic?

Compensation: Two Approaches

Choreography: Event-Based

Orchestration: Lambda-Based

Running Both Simultaneously

Observability

The Summary Table

What earns a skill

fastlane: one place that talks to Apple, and only one

The beta lane, and the lesson hiding in its control flow

The locale crash, or: how an em-dash took down my release

The bug-fix skill: branch off the buggy tag

When the automation breaks (and it will)