Tag: Distributed Transactions

  • Hazelcast Transactional Outbox: Guaranteed Delivery

    Hazelcast Transactional Outbox: Guaranteed Delivery

    Part 8 in the “Building Event-Driven Microservices with Hazelcast” series


    Introduction

    In Part 7, we added circuit breakers and retry to protect saga listeners from transient failures on the consumer side. That covers what happens when a service receives an event and can’t process it. But we haven’t talked about what happens when the event never leaves the building.

    Quick refresher on our dual-instance architecture: each service runs an embedded Hazelcast instance for local Jet pipeline processing and a client connected to the shared cluster for cross-service ITopic communication. After the pipeline processes an event, the EventSourcingController republishes it to the shared cluster so saga listeners in other services can react.

    That republish step? It was a fire-and-forget call:

    // The old approach — fragile
    try {
        ITopic<GenericRecord> topic = sharedHazelcast.getTopic(pending.eventType);
        topic.publish(pending.eventRecord);
    } catch (Exception e) {
        logger.warn("Failed to republish event {}: {}", pending.eventType, e.getMessage());
        // Event is permanently lost!
    }
    

    If the shared cluster is unreachable — network partition, cluster restart, someone tripping over the power cable — the event vanishes. The saga never progresses. Eventually the saga timeout detector marks it as failed, but by then the original event data is gone and there’s nothing to retry.

    The Transactional Outbox Pattern fixes this. Instead of publishing directly to the shared cluster, the controller writes the event to a local outbox — an IMap on the embedded Hazelcast instance — and a separate publisher component picks it up and delivers it. If delivery fails, the entry stays in the outbox and gets retried.


    Why Direct Publishing Fails

    The problem is fundamental. Publishing to an external system (the shared cluster) and completing a local operation (the Jet pipeline) are two separate operations that can’t be made atomic.

    Failure timeline for direct publishing — the Jet pipeline updates the local event store and materialized view, but the publish to the shared cluster ITopic fails on a network partition and the event is lost with nothing left to retry

    The event is safely stored in the local event store and materialized view, but the cross-service notification is lost. You could retry in place, but that blocks the Jet pipeline for all events. You could schedule an async retry, but if the process restarts, that retry state is gone too.

    The outbox pattern trades immediate delivery for guaranteed delivery. Write to a durable local store, deliver asynchronously, retry until it works. It’s the standard solution in event-driven architectures for good reason.


    Architecture

    Transactional outbox architecture — the EventSourcingController writes each event to a durable local OUTBOX IMap and signals the OutboxPublisher via a semaphore; the publisher claims pending entries and delivers them to the shared cluster ITopic, retrying on failure

    The outbox IMap lives on the embedded Hazelcast instance — the same instance that hosts the event store and materialized views. Writing to it is a local operation. If the embedded instance is up (and it must be, since the pipeline just ran), the outbox write succeeds.


    The OutboxEntry

    Each outbox entry captures everything needed to deliver the event later:

    public class OutboxEntry {
    
        private String eventId;          // Matches the domain event's eventId
        private String eventType;        // ITopic name (e.g., "OrderCreated")
        private GenericRecord eventRecord; // The serialized event to publish
        private int retryCount;          // Delivery attempts so far
        private Status status;           // PENDING, DELIVERED, or FAILED
        private Instant createdAt;       // When the entry was created
        private Instant lastAttemptAt;   // When the last delivery attempt occurred
        private String failureReason;    // Most recent failure message
    
        public enum Status {
            PENDING,    // Awaiting delivery
            DELIVERED,  // Successfully published to shared cluster
            FAILED      // Permanently failed after max retries
        }
    }
    

    The eventRecord field is the full GenericRecord that needs to go to the shared cluster’s ITopic — same record the Jet pipeline produces, complete with saga metadata like sagaId and correlationId.


    OutboxStore: The Interface

    Six methods covering the full lifecycle:

    public interface OutboxStore {
    
        void write(OutboxEntry entry);
    
        List<OutboxEntry> pollPending(int maxBatchSize);
    
        void markDelivered(String eventId);
    
        void markFailed(String eventId, String reason);
    
        void incrementRetryCount(String eventId, String failureReason);
    
        long pendingCount();
    }
    

    Provider-agnostic. The Hazelcast implementation uses an IMap, but the interface could just as easily sit in front of a database table.


    HazelcastOutboxStore

    The Hazelcast implementation stores entries as Compact-serialized GenericRecord values in an IMap:

    public class HazelcastOutboxStore implements OutboxStore {
    
        private static final String SCHEMA_NAME = "OutboxEntry";
        private final IMap<String, GenericRecord> outboxMap;
    
        public HazelcastOutboxStore(HazelcastInstance hazelcast, MeterRegistry meterRegistry) {
            this.outboxMap = hazelcast.getMap(DEFAULT_MAP_NAME);
        }
    }
    

    You might wonder why we’re using GenericRecord instead of storing OutboxEntry Java objects directly. The problem is that OutboxEntry has an Instant field and a nested GenericRecord — neither of which Hazelcast’s zero-config Compact serialization can handle. We’d need a custom CompactSerializer registered on every Hazelcast instance configuration. Instead, we convert at the boundary:

    static GenericRecord toRecord(final OutboxEntry entry) {
        return GenericRecordBuilder.compact(SCHEMA_NAME)
                .setString("eventId", entry.getEventId())
                .setString("eventType", entry.getEventType())
                .setGenericRecord("eventRecord", entry.getEventRecord())
                .setInt32("retryCount", entry.getRetryCount())
                .setString("status", entry.getStatus().name())
                .setInt64("createdAt", entry.getCreatedAt().toEpochMilli())
                .setNullableInt64("lastAttemptAt",
                        entry.getLastAttemptAt() != null
                                ? entry.getLastAttemptAt().toEpochMilli() : null)
                .setString("failureReason", entry.getFailureReason())
                .build();
    }
    

    A few things going on here. Instant becomes int64 epoch millis — compact, sortable, unambiguous. lastAttemptAt uses setNullableInt64 because it’s null until the first delivery attempt. The nested eventRecord uses setGenericRecord, which Compact handles natively. And status is stored as the enum name string, which makes it readable in Management Center and queryable with Predicates.equal().

    Polling uses a Hazelcast predicate to filter by status, sorted by creation time so the oldest entries are delivered first:

    @Override
    public List<OutboxEntry> pollPending(final int maxBatchSize) {
        final Collection<GenericRecord> pending = outboxMap.values(
                Predicates.equal("status", OutboxEntry.Status.PENDING.name()));
    
        return pending.stream()
                .map(HazelcastOutboxStore::fromRecord)
                .sorted(Comparator.comparing(OutboxEntry::getCreatedAt))
                .limit(maxBatchSize)
                .collect(Collectors.toList());
    }
    

    The OutboxPublisher

    The publisher bridges the outbox and the shared cluster. The obvious approach is to poll on a fixed interval — once per second, say — but that adds latency we don’t need. We know exactly when a new entry arrives.

    Event-Driven Wake-Up

    The publisher uses a Semaphore to sleep until someone signals it:

    public class OutboxPublisher {
    
        private final Semaphore wakeUp = new Semaphore(0);
    
        public void notifyNewEntry() {
            // Release at most 1 permit — avoids unbounded accumulation
            if (wakeUp.availablePermits() == 0) {
                wakeUp.release();
            }
        }
    
        public boolean waitForWork() {
            try {
                return wakeUp.tryAcquire(
                        properties.getPollInterval().toMillis(),
                        TimeUnit.MILLISECONDS);
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                return false;
            }
        }
    }
    

    When the EventSourcingController writes an outbox entry, it calls notifyNewEntry() right after. The publisher wakes up, claims all pending entries, delivers them. Under normal conditions, the time from event creation to shared-cluster delivery is sub-millisecond.

    The poll interval (default 1 second) is the safety net. If a signal gets missed — maybe the publisher was busy with a previous batch — the timeout ensures nothing sits around for too long.

    This is a JVM-local semaphore, not a distributed one. That’s fine. When the service scales to multiple replicas with per-service clustering (ADR 013), each replica has its own publisher. The semaphore wakes the local publisher instantly for locally-written events. Events written by other replicas get picked up within the poll interval. The actual coordination — preventing two replicas from delivering the same event — happens in claimPending() via an atomic ClaimEntryProcessor on the IMap.

    The Publish Loop

    public void publishPendingEntries() {
        if (sharedHazelcast == null) {
            if (!noSharedClusterWarningLogged) {
                logger.warn("No shared Hazelcast instance — outbox delivery skipped");
                noSharedClusterWarningLogged = true;
            }
            return;
        }
    
        List<OutboxEntry> claimed = outboxStore.claimPending(
                properties.getMaxBatchSize(), memberUuid);
    
        if (claimed.isEmpty()) {
            return;
        }
    
        for (OutboxEntry entry : claimed) {
            try {
                ITopic<GenericRecord> topic = sharedHazelcast.getTopic(entry.getEventType());
                topic.publish(entry.getEventRecord());
                outboxStore.markDelivered(entry.getEventId());
            } catch (Exception e) {
                if (entry.getRetryCount() + 1 >= properties.getMaxRetries()) {
                    outboxStore.markFailed(entry.getEventId(),
                            "Max retries exceeded: " + e.getMessage());
                } else {
                    outboxStore.incrementRetryCount(entry.getEventId(), e.getMessage());
                }
            }
        }
    }
    

    Note claimPending rather than pollPending. The claiming mechanism uses an EntryProcessor to atomically transition entries from PENDING to CLAIMED, tagging them with the claiming member’s UUID. This prevents two publisher instances from delivering the same event — important once you’re running multiple replicas.

    When no shared cluster is configured (single-node dev mode), the publisher logs one warning and stops trying. Events pile up as PENDING in the outbox. They’ll drain as soon as a shared cluster appears.

    Retry escalation is per-entry:

    Attempt 1: fails → incrementRetryCount (retryCount=1)
    Attempt 2: fails → incrementRetryCount (retryCount=2)
    ...
    Attempt 5: fails → markFailed (retryCount=5 >= maxRetries=5)
    

    Once marked FAILED, the entry stops showing up in claim results. The failure reason is preserved for debugging.

    Scheduling

    OutboxAutoConfiguration hooks the publisher into Spring’s task scheduler:

    @EnableScheduling
    public class OutboxAutoConfiguration implements SchedulingConfigurer {
    
        @Override
        public void configureTasks(ScheduledTaskRegistrar taskRegistrar) {
            taskRegistrar.addFixedDelayTask(() -> {
                outboxPublisher.waitForWork();       // blocks until signaled or timeout
                outboxPublisher.publishPendingEntries();
            }, 1);  // 1ms loop delay — actual timing controlled by semaphore
        }
    }
    

    The 1ms fixed delay means the loop restarts almost immediately after each cycle, but waitForWork() controls the actual pacing. The thread blocks on the semaphore until either a permit is released or the poll interval elapses. Near-instant delivery under normal load, guaranteed pickup if a signal is missed.


    Integration with EventSourcingController

    The controller’s republishToSharedCluster now checks for an outbox store first:

    private void republishToSharedCluster(PendingCompletion<K> pending) {
        if (sharedHazelcast == null || pending.eventRecord == null || pending.eventType == null) {
            return;
        }
        if (outboxStore != null) {
            OutboxEntry entry = new OutboxEntry(
                    pending.completionInfo.getEventId(),
                    pending.eventType,
                    pending.eventRecord
            );
            outboxStore.write(entry);
            if (outboxPublisher != null) {
                outboxPublisher.notifyNewEntry();
            }
        } else {
            // Legacy direct publish (when outbox is disabled)
            try {
                ITopic<GenericRecord> topic = sharedHazelcast.getTopic(pending.eventType);
                topic.publish(pending.eventRecord);
            } catch (Exception e) {
                logger.warn("Failed to republish event {}: {}", pending.eventType, e.getMessage());
            }
        }
    }
    

    Fully backward compatible. When outboxStore is injected, events go through the durable path. When it’s null, you get the old fire-and-forget behavior. The OutboxStore is wired through each service’s config as an optional dependency:

    @Bean
    public EventSourcingController<Order, String, DomainEvent<Order, String>> orderController(
            HazelcastInstance hazelcastInstance,
            @Qualifier("hazelcastClient") HazelcastInstance hazelcastClient,
            @Autowired(required = false) OutboxStore outboxStore,
            ...) {
        return EventSourcingController.builder()
                .hazelcast(hazelcastInstance)
                .sharedHazelcast(hazelcastClient)
                .outboxStore(outboxStore)
                .build();
    }
    

    Delivery Guarantees

    The outbox provides at-least-once delivery. If the publisher crashes after publishing to the ITopic but before calling markDelivered(), the next cycle picks up the same entry and delivers it again. Events are never lost as long as the embedded Hazelcast instance’s IMap data is intact.

    At-least-once means consumers may see duplicates. That’s where the Idempotency Guard from Part 9 comes in — it deduplicates on the consumer side, complementing the outbox’s guaranteed delivery.

    As for ordering: events for the same aggregate are written to the outbox in sequence order (the Jet pipeline processes them sequentially), and claimPending sorts by createdAt. But if two events are pending simultaneously and the first one fails while the second succeeds, they’ll arrive out of order. For our saga use case that’s acceptable — each step is identified by sagaId and eventType, and the saga state machine handles duplicates and out-of-order delivery.


    Configuration

    framework.outbox.*

    Property Default Description
    enabled true Master toggle for the outbox pattern
    poll-interval 1000 (ms) Fallback interval if signal is missed
    max-batch-size 50 Maximum entries per poll cycle
    max-retries 5 Delivery attempts before permanent failure
    entry-ttl 24h How long DELIVERED entries survive in the map

    Metrics

    Metric Type Description
    outbox.entries.written Counter Events written to the outbox
    outbox.entries.delivered Counter Events delivered to shared cluster
    outbox.entries.failed Counter Events permanently failed
    outbox.publish.duration Timer Time per publish cycle

    To disable the outbox and use direct publishing:

    framework:
      outbox:
        enabled: false
    

    What’s Next

    The outbox guarantees events reach the shared cluster. But what happens when they get there and the consumer can’t process them? The consumer might crash, the business logic might throw, the circuit breaker might be open.

    In Part 9, we add two patterns that work together: a Dead Letter Queue that captures events that fail consumer-side processing, and an Idempotency Guard that prevents duplicate processing — the natural flip side of at-least-once delivery.


    Next up: Dead Letter Queues and Idempotency

    Previous: Circuit Breakers and Retry: Resilient Hazelcast Sagas

    Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.
  • Saga Pattern: Distributed Transactions Without 2PC

    Saga Pattern: Distributed Transactions Without 2PC

    Part 4 in the “Building Event-Driven Microservices with Hazelcast” series


    Introduction

    In the first three articles, we built an event sourcing framework with a Jet pipeline and materialized views. Each service is self-contained — it owns its events, its views, and its data. And one of the things that makes event-driven architecture powerful is that event publishing is fire-and-forget. The publisher doesn’t know or care who’s receiving the messages, or whether anyone acted on them properly.

    But someone cares. Probably your boss.

    That decoupling is a feature, not a bug — it’s exactly what gives you the independence to evolve services separately. If you wanted to add a loyalty program based on orders placed, you wouldn’t touch a single line of existing code. You’d build the new functionality and subscribe it to OrderCreated events. Done. No coordination, no release trains, no “can you add a field to your response for us.”

    But that same independence becomes a problem when a business operation requires coordination. Placing an order in our eCommerce system spans four services:

    1. Order Service creates the order
    2. Inventory Service reserves stock
    3. Payment Service charges the customer
    4. Order Service confirms the order

    If payment fails after stock is reserved, we need to release the stock. If the inventory service is down, we need to cancel the order. The fire-and-forget publisher doesn’t know any of this happened — and nobody else is keeping track unless we build something that does.

    Now — a real eCommerce system wouldn’t cancel an order just because the inventory service hiccupped. You’d create a backorder, or queue the reservation for retry, or do whatever it takes to keep the customer’s money. Nobody in the business is going to say “yeah, let’s give up on that sale because a container restarted.” But we’re building a framework to demonstrate microservice patterns, not to compete with Shopify. The cancel-and-compensate flow shows sagas at their most illustrative: forward steps, failure detection, compensation, and state tracking across services. That’s the machinery we want to examine. So we went with the version that best shows how the pattern works, not the version that best sells laptops.

    This is the distributed transaction problem, and sagas are the solution.


    Why Sagas? Because the Alternative Is Worse.

    We didn’t adopt event sourcing because it was fashionable. We adopted it because the alternative — mutable state scattered across services with no audit trail — was worse. Sagas are the same kind of choice.

    When a business operation touches one database, you wrap it in a transaction. ACID gives you atomicity: either all the changes commit or none of them do. Every developer learns this in their first database course, and it works.

    But our order fulfillment touches four services, each with its own data. A single database transaction can’t span them. So you reach for two-phase commit — 2PC — the textbook answer for distributed transactions. And that’s where the trouble starts.

    2PC works by appointing a coordinator that asks each participant “can you commit?” and then, once everyone says yes, tells them all to go ahead. The problem is what happens between those two phases. Every participant is holding locks — on inventory rows, on payment records, on order state — while waiting for the coordinator’s decision. In a monolith, that wait is microseconds. Across a network, it’s milliseconds at best, and if the coordinator is slow or the network partitions, those locks can be held for seconds. Or longer.

    Think about what that means for throughput. While one order is sitting in that limbo state between “prepare” and “commit,” no other transaction can touch those same inventory rows. At 50 orders per second, you’ve got 50 sets of distributed locks competing for the same resources, each one waiting on network round-trips to a coordinator that is itself a single point of failure. If the coordinator crashes mid-protocol, those locks stay held until someone intervenes manually. The participants are stuck — they’ve promised to commit but haven’t been told to, and they can’t unilaterally release the locks without risking inconsistency.

    It’s not that 2PC is theoretically wrong. It works fine in environments where all participants are on the same local network, latency is sub-millisecond, and the coordinator never fails. That environment is called “a single database server.” Once you’ve distributed beyond that, you’re fighting physics.

    Sagas take a fundamentally different approach. Instead of one big atomic transaction with distributed locks, you execute a sequence of local transactions — each one fast, each one touching only one service’s data, each one committing immediately. If step 3 fails, you don’t roll back steps 1 and 2 by releasing locks. You compensate — you issue new transactions that undo the business effect of the earlier steps. Reserve stock, then charge payment, then… payment fails? Issue a stock release. The saga doesn’t pretend the work never happened. It acknowledges what happened and fixes it forward.

    The trade-off is that you give up the clean atomicity of “all or nothing” and accept eventual consistency — a window of time where stock is reserved but payment hasn’t been attempted yet, or payment has been charged but the order isn’t confirmed. In our system, that window is typically under a second. For the user, the order is “processing.” For the system, the saga is in flight. And if something fails, compensation events fire and the system converges to a consistent state — just not instantaneously.


    Choreography and Orchestration

    There are two ways to coordinate a saga. They represent genuinely different architectural philosophies, and neither is universally right.

    Choreography: Services React to Events

    In a choreographed saga, there’s no coordinator. Each service listens for events that concern it and reacts:

    Order Service publishes OrderCreated -->
      Inventory Service hears it, reserves stock, publishes StockReserved -->
        Payment Service hears it, charges customer, publishes PaymentProcessed -->
          Order Service hears it, confirms order
    

    The flow emerges from the interactions between services. No single component knows the full sequence. Each service knows only “when I see event X, I do Y and publish Z.”

    Orchestration: A Central Controller Directs the Flow

    In an orchestrated saga, one component — the orchestrator — knows the whole sequence and directs each step:

    Orchestrator --> "Order Service, create order"
    Orchestrator --> "Inventory Service, reserve stock"
    Orchestrator --> "Payment Service, charge customer"
    Orchestrator --> "Order Service, confirm order"
    

    The orchestrator is a state machine. It tracks where the saga is, what comes next, and what to undo if something fails.

    Choosing Between Them

    Choreography is the simpler choice when your saga is a linear chain — A triggers B triggers C triggers D — and the services are already publishing events for other reasons. If your system is event-driven (ours is), choreographed sagas are almost free. The services are already emitting events. The saga is just another consumer. No new infrastructure, no single point of failure, and each service stays fully independent.

    But choreography gets painful as the flow gets complex. Say step 2 needs to branch — charge a credit card or apply store credit, depending on the payment method. Now you’re encoding conditional logic across multiple listeners that can’t see each other. If someone asks “what does this saga actually do, end to end?” the answer is “go read three listener classes in three different services and piece it together.” For a four-step linear chain, fine. For a ten-step flow with branches and conditional skips, that’s a mess.

    That’s where orchestration earns its keep. The entire saga — forward steps, compensation steps, timeouts, retry logic — lives in a single definition file. You read it top to bottom. You can add per-step timeouts, per-step retries. The caller can wait for the result synchronously. The cost is coupling: the orchestrator needs to know about every service’s endpoint, and it becomes a component you have to keep running.

    A rough rule of thumb:

    Start with choreography when the saga is a linear chain with fewer than about five steps, the services already publish events, and you don’t need a synchronous response. Move to orchestration when you need branching or conditional logic, per-step timeout and retry control, a synchronous success/failure response for the caller, or when the saga has gotten complex enough that nobody can trace it across listeners without a whiteboard.

    We started with choreography for order fulfillment because it’s a clean four-step chain and our entire architecture is already built around events. Later, when we needed synchronous responses for certain API paths and wanted per-step timeout control, we added orchestration as a second option. Both patterns coexist in the framework, running against the same saga state store. We’ll cover the orchestration implementation and how we run both side by side in a later article.


    The Order Fulfillment Saga

    Here’s the full flow for the choreographed version:

    Order Fulfillment Saga happy path — four services emitting OrderCreated, StockReserved, PaymentProcessed, OrderConfirmed in sequence, with saga state transitioning from STARTED through IN_PROGRESS to COMPLETED

    Event Context Propagation

    Every saga event carries metadata that links the chain together:

    // These fields are present on every saga event
    String sagaId;         // Links all events in one saga instance
    String correlationId;  // Links to the original API request
    String sagaType;       // "OrderFulfillment"
    int stepNumber;        // Which step (0, 1, 2, 3)
    boolean isCompensating; // True for compensation events
    

    The sagaId is generated when the order is created and propagated through every subsequent event. That’s how the Inventory Service knows which saga a StockReserved event belongs to, and how the Payment Service gets the payment details — amount, currency, method — from the StockReserved event’s context fields.


    Implementing Saga Listeners

    Each service has a saga listener that subscribes to relevant events via Hazelcast ITopic. Here’s the Payment Service’s:

    @Component
    public class PaymentSagaListener {
    
        public PaymentSagaListener(
                @Qualifier("hazelcastClient") HazelcastInstance hazelcast,
                PaymentService paymentService,
                SagaStateStore sagaStateStore) {
    
            // Listen for StockReserved --> process payment
            ITopic<GenericRecord> stockReservedTopic = hazelcast.getTopic("StockReserved");
            stockReservedTopic.addMessageListener(message -> {
                GenericRecord event = message.getMessageObject();
                String sagaId = event.getString("sagaId");
                String orderId = event.getString("orderId");
                String amount = event.getString("paymentAmount");
                String currency = event.getString("paymentCurrency");
                String method = event.getString("paymentMethod");
    
                paymentService.processPaymentForOrder(
                    orderId, customerId, amount, currency, method,
                    sagaId, event.getString("correlationId")
                );
            });
    
            // Listen for PaymentRefundRequested --> process refund
            ITopic<GenericRecord> refundTopic = hazelcast.getTopic("PaymentRefundRequested");
            refundTopic.addMessageListener(message -> {
                GenericRecord event = message.getMessageObject();
                String paymentId = event.getString("paymentId");
                String sagaId = event.getString("sagaId");
    
                paymentService.refundPaymentForSaga(
                    paymentId, "Saga compensation", sagaId,
                    event.getString("correlationId")
                );
            });
        }
    }
    

    The listener uses @Qualifier(“hazelcastClient”) — it connects to the shared cluster, not the embedded instance. That’s the dual-instance architecture from Part 2. Each listener is a plain @Component that Spring creates on startup; the ITopic subscription stays active for the life of the service. And the listener itself doesn’t contain business logic — it unpacks the event, delegates to the service layer, and gets out of the way.


    Compensation: Undoing Work

    When a step fails, we need to undo the work completed by previous steps. This is compensation — often described as the saga equivalent of a rollback, though it isn’t really a rollback at all. More on that in a minute.

    Compensation Flow: Payment Fails

    Order Fulfillment Saga compensation flow — payment fails at step 3, triggering reverse-order compensation events PaymentRefundRequested, StockReleased, and OrderCancelled, ending in the COMPENSATED state

    Compensation Flow: Stock Unavailable

    StockReservationFailed event
             |
             '---> Order Service
                   '-- Cancels order (status: CANCELLED)
                   '-- No stock release needed (nothing was reserved)
    
             Saga finalized as COMPENSATED
    

    The Compensation Registry

    A CompensationRegistry maps each forward event to its compensating event and responsible service:

    @Configuration
    public class ECommerceCompensationConfig {
    
        @Bean
        public CompensationRegistrar ecommerceCompensations(CompensationRegistry registry) {
            // Step 0: OrderCreated --> compensate with OrderCancelled
            registry.register("OrderCreated", "OrderCancelled", "order-service");
    
            // Step 1: StockReserved --> compensate with StockReleased
            registry.register("StockReserved", "StockReleased", "inventory-service");
    
            // Step 2: PaymentProcessed --> compensate with PaymentRefunded
            registry.register("PaymentProcessed", "PaymentRefunded", "payment-service");
    
            // Step 3: OrderConfirmed --> no compensation (terminal success)
            return () -> {};
        }
    }
    

    This is the one place that documents the entire saga structure. Even though the execution is distributed across listeners, the registry makes the step-to-compensation mapping readable in a single file. That’s something people miss about choreography — it doesn’t mean the structure is undocumented, it means the structure is declared separately from the execution.


    Saga State Tracking

    The SagaState class is an immutable state machine that tracks each saga instance:

    public class SagaState implements Serializable {
    
        private final String sagaId;
        private final String sagaType;         // "OrderFulfillment"
        private final String correlationId;
        private final SagaStatus status;       // STARTED --> IN_PROGRESS --> COMPLETED
        private final List<SagaStepRecord> steps;
        private final Instant startedAt;
        private final Instant deadline;        // Absolute timeout
    }
    

    Status transitions:

    STARTED --> IN_PROGRESS --> COMPLETED     (happy path)
    STARTED --> IN_PROGRESS --> COMPENSATING --> COMPENSATED  (failure + recovery)
    STARTED --> IN_PROGRESS --> TIMED_OUT --> COMPENSATING --> COMPENSATED  (timeout)
    STARTED --> IN_PROGRESS --> FAILED        (unrecoverable)
    

    The state lives in a HazelcastSagaStateStore backed by an IMap on the shared cluster. Every service can read and update saga state because they all connect to the same shared cluster via the client instance.

    Each step gets recorded:

    public class SagaStepRecord implements Serializable {
    
        private final int stepNumber;
        private final String eventType;
        private final StepStatus status;    // PENDING, COMPLETED, FAILED, COMPENSATED
        private final Instant completedAt;
        private final String failureReason;
    }
    

    Timeout Handling

    Sagas can get stuck. A service might be down, a message might be lost, or a listener might throw an unhandled exception. Without timeout detection, a stuck saga hangs forever — stock reserved but never charged, or charged but never confirmed.

    The SagaTimeoutDetector is a scheduled service that runs in each service instance:

    @Component
    public class SagaTimeoutDetector {
    
        @Scheduled(fixedDelayString = "${saga.timeout.check-interval:5000}")
        public void detectTimeouts() {
            List<SagaState> timedOut = sagaStateStore.findTimedOutSagas(
                maxBatchSize
            );
    
            for (SagaState saga : timedOut) {
                sagaStateStore.markTimedOut(saga.getSagaId());
    
                if (autoCompensate) {
                    compensator.compensate(saga);
                }
    
                applicationEventPublisher.publishEvent(
                    new SagaTimedOutEvent(saga)
                );
    
                sagaMetrics.recordSagaTimedOut(saga.getSagaType());
            }
        }
    }
    

    Every five seconds, it looks for sagas that have blown past their deadline. When it finds one, it marks it timed out, triggers compensation for whatever steps already completed, publishes a Spring event for logging and alerting, and records the metric.

    Timeout behavior is configurable per service:

    saga:
      timeout:
        enabled: true
        check-interval: 5000          # Check every 5 seconds
        default-deadline: 30000       # 30-second default timeout
        auto-compensate: true         # Auto-trigger compensation
        max-batch-size: 100           # Process up to 100 timeouts per check
        saga-types:
          OrderFulfillment: 60000     # 60 seconds for order fulfillment
    

    The Order Fulfillment saga gets 60 seconds — longer than the 30-second default — because it spans four services and includes payment processing, which can be slow. Choosing the right timeout value is harder than it sounds, and we have quite a bit more to say about that in the next article.


    Monitoring Sagas

    The SagaMetrics class exposes counters and timers to Prometheus:

    Metric What It Tells You
    saga.started How many sagas are being created
    saga.completed How many succeed end-to-end
    saga.compensated How many required rollback
    saga.timedout How many exceeded their deadline
    saga.duration (p95) How long sagas are taking
    saga.compensation.duration (p95) How long compensation takes

    The pre-provisioned Saga Dashboard in Grafana visualizes active saga counts, throughput rates by saga type, duration percentiles, success rates, timeout detection rates, and compensation breakdowns. Pre-configured alerts fire on high failure rates, timeouts, compensation failures, and success rate drops below 90%.

    The dashboard is useful. But the metric that actually saved us during a sustained load incident was dead simple: saga.timedout plotted alongside saga.completed. When timeouts exceeded completions, we knew we had a systemic problem, not just a few slow sagas. More on that story next time.


    Design Decisions and Trade-offs

    Eventual Consistency

    Sagas provide eventual consistency, not immediate consistency. Between the time stock is reserved and payment is processed, the system is in a “pending” state. That’s intentional — the trade-off buys us service independence and availability. For our use case, the consistency window is sub-second. For domains where that’s not acceptable (financial settlement, medical records), you’d need stronger guarantees.

    Idempotency

    Saga listeners must be idempotent. ITopic delivery is at-most-once in Hazelcast, but the timeout detector might trigger compensation for a saga that was actually processing — just slowly. If the original flow completes after compensation starts, the system has to handle the overlap gracefully.

    SagaState handles this with updateOrAddStep, which replaces by step number. A step can’t be recorded twice.

    Compensation Is Not Rollback

    This is worth being explicit about, because the terminology invites confusion.

    Compensation undoes the business effect, not the technical state. When we “refund a payment,” we don’t delete the payment record. We create a new PaymentRefundedEvent that changes the payment status to REFUNDED. The event history preserves the full story: payment was processed, then refunded. If an auditor comes asking, the trail is complete.

    This fits naturally with event sourcing. The compensation is just another event. Nothing is erased, nothing is pretended away. The system records what actually happened, including the part where something went wrong and was corrected.


    Summary

    The saga pattern solves distributed transaction coordination without distributed locks:

    Choreographed sagas fit naturally with event sourcing — each service reacts to events independently, no coordinator required. Orchestrated sagas earn their place when flows grow complex, need synchronous responses, or require per-step control. Compensation provides rollback-like semantics through new events, preserving the full history. Timeout detection catches stuck sagas. Saga state tracking gives you a complete audit trail.

    The result is a system where four independent services coordinate a complex business transaction without any service knowing about the others — they only know about events. The cost is that you’re managing a distributed state machine instead of a database transaction. But the alternative — distributed locks held across network calls while a coordinator decides everyone’s fate — is worse.


    Next up: Saga Timeouts — When Distributed Things Go Wrong

    Previous: Materialized Views for Fast Queries

    Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.