Tag: ai

  • Saga Pattern: Distributed Transactions Without 2PC

    Saga Pattern: Distributed Transactions Without 2PC

    Part 4 in the “Building Event-Driven Microservices with Hazelcast” series


    Introduction

    In the first three articles, we built an event sourcing framework with a Jet pipeline and materialized views. Each service is self-contained — it owns its events, its views, and its data. And one of the things that makes event-driven architecture powerful is that event publishing is fire-and-forget. The publisher doesn’t know or care who’s receiving the messages, or whether anyone acted on them properly.

    But someone cares. Probably your boss.

    That decoupling is a feature, not a bug — it’s exactly what gives you the independence to evolve services separately. If you wanted to add a loyalty program based on orders placed, you wouldn’t touch a single line of existing code. You’d build the new functionality and subscribe it to OrderCreated events. Done. No coordination, no release trains, no “can you add a field to your response for us.”

    But that same independence becomes a problem when a business operation requires coordination. Placing an order in our eCommerce system spans four services:

    1. Order Service creates the order
    2. Inventory Service reserves stock
    3. Payment Service charges the customer
    4. Order Service confirms the order

    If payment fails after stock is reserved, we need to release the stock. If the inventory service is down, we need to cancel the order. The fire-and-forget publisher doesn’t know any of this happened — and nobody else is keeping track unless we build something that does.

    Now — a real eCommerce system wouldn’t cancel an order just because the inventory service hiccupped. You’d create a backorder, or queue the reservation for retry, or do whatever it takes to keep the customer’s money. Nobody in the business is going to say “yeah, let’s give up on that sale because a container restarted.” But we’re building a framework to demonstrate microservice patterns, not to compete with Shopify. The cancel-and-compensate flow shows sagas at their most illustrative: forward steps, failure detection, compensation, and state tracking across services. That’s the machinery we want to examine. So we went with the version that best shows how the pattern works, not the version that best sells laptops.

    This is the distributed transaction problem, and sagas are the solution.


    Why Sagas? Because the Alternative Is Worse.

    We didn’t adopt event sourcing because it was fashionable. We adopted it because the alternative — mutable state scattered across services with no audit trail — was worse. Sagas are the same kind of choice.

    When a business operation touches one database, you wrap it in a transaction. ACID gives you atomicity: either all the changes commit or none of them do. Every developer learns this in their first database course, and it works.

    But our order fulfillment touches four services, each with its own data. A single database transaction can’t span them. So you reach for two-phase commit — 2PC — the textbook answer for distributed transactions. And that’s where the trouble starts.

    2PC works by appointing a coordinator that asks each participant “can you commit?” and then, once everyone says yes, tells them all to go ahead. The problem is what happens between those two phases. Every participant is holding locks — on inventory rows, on payment records, on order state — while waiting for the coordinator’s decision. In a monolith, that wait is microseconds. Across a network, it’s milliseconds at best, and if the coordinator is slow or the network partitions, those locks can be held for seconds. Or longer.

    Think about what that means for throughput. While one order is sitting in that limbo state between “prepare” and “commit,” no other transaction can touch those same inventory rows. At 50 orders per second, you’ve got 50 sets of distributed locks competing for the same resources, each one waiting on network round-trips to a coordinator that is itself a single point of failure. If the coordinator crashes mid-protocol, those locks stay held until someone intervenes manually. The participants are stuck — they’ve promised to commit but haven’t been told to, and they can’t unilaterally release the locks without risking inconsistency.

    It’s not that 2PC is theoretically wrong. It works fine in environments where all participants are on the same local network, latency is sub-millisecond, and the coordinator never fails. That environment is called “a single database server.” Once you’ve distributed beyond that, you’re fighting physics.

    Sagas take a fundamentally different approach. Instead of one big atomic transaction with distributed locks, you execute a sequence of local transactions — each one fast, each one touching only one service’s data, each one committing immediately. If step 3 fails, you don’t roll back steps 1 and 2 by releasing locks. You compensate — you issue new transactions that undo the business effect of the earlier steps. Reserve stock, then charge payment, then… payment fails? Issue a stock release. The saga doesn’t pretend the work never happened. It acknowledges what happened and fixes it forward.

    The trade-off is that you give up the clean atomicity of “all or nothing” and accept eventual consistency — a window of time where stock is reserved but payment hasn’t been attempted yet, or payment has been charged but the order isn’t confirmed. In our system, that window is typically under a second. For the user, the order is “processing.” For the system, the saga is in flight. And if something fails, compensation events fire and the system converges to a consistent state — just not instantaneously.


    Choreography and Orchestration

    There are two ways to coordinate a saga. They represent genuinely different architectural philosophies, and neither is universally right.

    Choreography: Services React to Events

    In a choreographed saga, there’s no coordinator. Each service listens for events that concern it and reacts:

    Order Service publishes OrderCreated -->
      Inventory Service hears it, reserves stock, publishes StockReserved -->
        Payment Service hears it, charges customer, publishes PaymentProcessed -->
          Order Service hears it, confirms order
    

    The flow emerges from the interactions between services. No single component knows the full sequence. Each service knows only “when I see event X, I do Y and publish Z.”

    Orchestration: A Central Controller Directs the Flow

    In an orchestrated saga, one component — the orchestrator — knows the whole sequence and directs each step:

    Orchestrator --> "Order Service, create order"
    Orchestrator --> "Inventory Service, reserve stock"
    Orchestrator --> "Payment Service, charge customer"
    Orchestrator --> "Order Service, confirm order"
    

    The orchestrator is a state machine. It tracks where the saga is, what comes next, and what to undo if something fails.

    Choosing Between Them

    Choreography is the simpler choice when your saga is a linear chain — A triggers B triggers C triggers D — and the services are already publishing events for other reasons. If your system is event-driven (ours is), choreographed sagas are almost free. The services are already emitting events. The saga is just another consumer. No new infrastructure, no single point of failure, and each service stays fully independent.

    But choreography gets painful as the flow gets complex. Say step 2 needs to branch — charge a credit card or apply store credit, depending on the payment method. Now you’re encoding conditional logic across multiple listeners that can’t see each other. If someone asks “what does this saga actually do, end to end?” the answer is “go read three listener classes in three different services and piece it together.” For a four-step linear chain, fine. For a ten-step flow with branches and conditional skips, that’s a mess.

    That’s where orchestration earns its keep. The entire saga — forward steps, compensation steps, timeouts, retry logic — lives in a single definition file. You read it top to bottom. You can add per-step timeouts, per-step retries. The caller can wait for the result synchronously. The cost is coupling: the orchestrator needs to know about every service’s endpoint, and it becomes a component you have to keep running.

    A rough rule of thumb:

    Start with choreography when the saga is a linear chain with fewer than about five steps, the services already publish events, and you don’t need a synchronous response. Move to orchestration when you need branching or conditional logic, per-step timeout and retry control, a synchronous success/failure response for the caller, or when the saga has gotten complex enough that nobody can trace it across listeners without a whiteboard.

    We started with choreography for order fulfillment because it’s a clean four-step chain and our entire architecture is already built around events. Later, when we needed synchronous responses for certain API paths and wanted per-step timeout control, we added orchestration as a second option. Both patterns coexist in the framework, running against the same saga state store. We’ll cover the orchestration implementation and how we run both side by side in a later article.


    The Order Fulfillment Saga

    Here’s the full flow for the choreographed version:

    Order Fulfillment Saga happy path — four services emitting OrderCreated, StockReserved, PaymentProcessed, OrderConfirmed in sequence, with saga state transitioning from STARTED through IN_PROGRESS to COMPLETED

    Event Context Propagation

    Every saga event carries metadata that links the chain together:

    // These fields are present on every saga event
    String sagaId;         // Links all events in one saga instance
    String correlationId;  // Links to the original API request
    String sagaType;       // "OrderFulfillment"
    int stepNumber;        // Which step (0, 1, 2, 3)
    boolean isCompensating; // True for compensation events
    

    The sagaId is generated when the order is created and propagated through every subsequent event. That’s how the Inventory Service knows which saga a StockReserved event belongs to, and how the Payment Service gets the payment details — amount, currency, method — from the StockReserved event’s context fields.


    Implementing Saga Listeners

    Each service has a saga listener that subscribes to relevant events via Hazelcast ITopic. Here’s the Payment Service’s:

    @Component
    public class PaymentSagaListener {
    
        public PaymentSagaListener(
                @Qualifier("hazelcastClient") HazelcastInstance hazelcast,
                PaymentService paymentService,
                SagaStateStore sagaStateStore) {
    
            // Listen for StockReserved --> process payment
            ITopic<GenericRecord> stockReservedTopic = hazelcast.getTopic("StockReserved");
            stockReservedTopic.addMessageListener(message -> {
                GenericRecord event = message.getMessageObject();
                String sagaId = event.getString("sagaId");
                String orderId = event.getString("orderId");
                String amount = event.getString("paymentAmount");
                String currency = event.getString("paymentCurrency");
                String method = event.getString("paymentMethod");
    
                paymentService.processPaymentForOrder(
                    orderId, customerId, amount, currency, method,
                    sagaId, event.getString("correlationId")
                );
            });
    
            // Listen for PaymentRefundRequested --> process refund
            ITopic<GenericRecord> refundTopic = hazelcast.getTopic("PaymentRefundRequested");
            refundTopic.addMessageListener(message -> {
                GenericRecord event = message.getMessageObject();
                String paymentId = event.getString("paymentId");
                String sagaId = event.getString("sagaId");
    
                paymentService.refundPaymentForSaga(
                    paymentId, "Saga compensation", sagaId,
                    event.getString("correlationId")
                );
            });
        }
    }
    

    The listener uses @Qualifier(“hazelcastClient”) — it connects to the shared cluster, not the embedded instance. That’s the dual-instance architecture from Part 2. Each listener is a plain @Component that Spring creates on startup; the ITopic subscription stays active for the life of the service. And the listener itself doesn’t contain business logic — it unpacks the event, delegates to the service layer, and gets out of the way.


    Compensation: Undoing Work

    When a step fails, we need to undo the work completed by previous steps. This is compensation — often described as the saga equivalent of a rollback, though it isn’t really a rollback at all. More on that in a minute.

    Compensation Flow: Payment Fails

    Order Fulfillment Saga compensation flow — payment fails at step 3, triggering reverse-order compensation events PaymentRefundRequested, StockReleased, and OrderCancelled, ending in the COMPENSATED state

    Compensation Flow: Stock Unavailable

    StockReservationFailed event
             |
             '---> Order Service
                   '-- Cancels order (status: CANCELLED)
                   '-- No stock release needed (nothing was reserved)
    
             Saga finalized as COMPENSATED
    

    The Compensation Registry

    A CompensationRegistry maps each forward event to its compensating event and responsible service:

    @Configuration
    public class ECommerceCompensationConfig {
    
        @Bean
        public CompensationRegistrar ecommerceCompensations(CompensationRegistry registry) {
            // Step 0: OrderCreated --> compensate with OrderCancelled
            registry.register("OrderCreated", "OrderCancelled", "order-service");
    
            // Step 1: StockReserved --> compensate with StockReleased
            registry.register("StockReserved", "StockReleased", "inventory-service");
    
            // Step 2: PaymentProcessed --> compensate with PaymentRefunded
            registry.register("PaymentProcessed", "PaymentRefunded", "payment-service");
    
            // Step 3: OrderConfirmed --> no compensation (terminal success)
            return () -> {};
        }
    }
    

    This is the one place that documents the entire saga structure. Even though the execution is distributed across listeners, the registry makes the step-to-compensation mapping readable in a single file. That’s something people miss about choreography — it doesn’t mean the structure is undocumented, it means the structure is declared separately from the execution.


    Saga State Tracking

    The SagaState class is an immutable state machine that tracks each saga instance:

    public class SagaState implements Serializable {
    
        private final String sagaId;
        private final String sagaType;         // "OrderFulfillment"
        private final String correlationId;
        private final SagaStatus status;       // STARTED --> IN_PROGRESS --> COMPLETED
        private final List<SagaStepRecord> steps;
        private final Instant startedAt;
        private final Instant deadline;        // Absolute timeout
    }
    

    Status transitions:

    STARTED --> IN_PROGRESS --> COMPLETED     (happy path)
    STARTED --> IN_PROGRESS --> COMPENSATING --> COMPENSATED  (failure + recovery)
    STARTED --> IN_PROGRESS --> TIMED_OUT --> COMPENSATING --> COMPENSATED  (timeout)
    STARTED --> IN_PROGRESS --> FAILED        (unrecoverable)
    

    The state lives in a HazelcastSagaStateStore backed by an IMap on the shared cluster. Every service can read and update saga state because they all connect to the same shared cluster via the client instance.

    Each step gets recorded:

    public class SagaStepRecord implements Serializable {
    
        private final int stepNumber;
        private final String eventType;
        private final StepStatus status;    // PENDING, COMPLETED, FAILED, COMPENSATED
        private final Instant completedAt;
        private final String failureReason;
    }
    

    Timeout Handling

    Sagas can get stuck. A service might be down, a message might be lost, or a listener might throw an unhandled exception. Without timeout detection, a stuck saga hangs forever — stock reserved but never charged, or charged but never confirmed.

    The SagaTimeoutDetector is a scheduled service that runs in each service instance:

    @Component
    public class SagaTimeoutDetector {
    
        @Scheduled(fixedDelayString = "${saga.timeout.check-interval:5000}")
        public void detectTimeouts() {
            List<SagaState> timedOut = sagaStateStore.findTimedOutSagas(
                maxBatchSize
            );
    
            for (SagaState saga : timedOut) {
                sagaStateStore.markTimedOut(saga.getSagaId());
    
                if (autoCompensate) {
                    compensator.compensate(saga);
                }
    
                applicationEventPublisher.publishEvent(
                    new SagaTimedOutEvent(saga)
                );
    
                sagaMetrics.recordSagaTimedOut(saga.getSagaType());
            }
        }
    }
    

    Every five seconds, it looks for sagas that have blown past their deadline. When it finds one, it marks it timed out, triggers compensation for whatever steps already completed, publishes a Spring event for logging and alerting, and records the metric.

    Timeout behavior is configurable per service:

    saga:
      timeout:
        enabled: true
        check-interval: 5000          # Check every 5 seconds
        default-deadline: 30000       # 30-second default timeout
        auto-compensate: true         # Auto-trigger compensation
        max-batch-size: 100           # Process up to 100 timeouts per check
        saga-types:
          OrderFulfillment: 60000     # 60 seconds for order fulfillment
    

    The Order Fulfillment saga gets 60 seconds — longer than the 30-second default — because it spans four services and includes payment processing, which can be slow. Choosing the right timeout value is harder than it sounds, and we have quite a bit more to say about that in the next article.


    Monitoring Sagas

    The SagaMetrics class exposes counters and timers to Prometheus:

    Metric What It Tells You
    saga.started How many sagas are being created
    saga.completed How many succeed end-to-end
    saga.compensated How many required rollback
    saga.timedout How many exceeded their deadline
    saga.duration (p95) How long sagas are taking
    saga.compensation.duration (p95) How long compensation takes

    The pre-provisioned Saga Dashboard in Grafana visualizes active saga counts, throughput rates by saga type, duration percentiles, success rates, timeout detection rates, and compensation breakdowns. Pre-configured alerts fire on high failure rates, timeouts, compensation failures, and success rate drops below 90%.

    The dashboard is useful. But the metric that actually saved us during a sustained load incident was dead simple: saga.timedout plotted alongside saga.completed. When timeouts exceeded completions, we knew we had a systemic problem, not just a few slow sagas. More on that story next time.


    Design Decisions and Trade-offs

    Eventual Consistency

    Sagas provide eventual consistency, not immediate consistency. Between the time stock is reserved and payment is processed, the system is in a “pending” state. That’s intentional — the trade-off buys us service independence and availability. For our use case, the consistency window is sub-second. For domains where that’s not acceptable (financial settlement, medical records), you’d need stronger guarantees.

    Idempotency

    Saga listeners must be idempotent. ITopic delivery is at-most-once in Hazelcast, but the timeout detector might trigger compensation for a saga that was actually processing — just slowly. If the original flow completes after compensation starts, the system has to handle the overlap gracefully.

    SagaState handles this with updateOrAddStep, which replaces by step number. A step can’t be recorded twice.

    Compensation Is Not Rollback

    This is worth being explicit about, because the terminology invites confusion.

    Compensation undoes the business effect, not the technical state. When we “refund a payment,” we don’t delete the payment record. We create a new PaymentRefundedEvent that changes the payment status to REFUNDED. The event history preserves the full story: payment was processed, then refunded. If an auditor comes asking, the trail is complete.

    This fits naturally with event sourcing. The compensation is just another event. Nothing is erased, nothing is pretended away. The system records what actually happened, including the part where something went wrong and was corrected.


    Summary

    The saga pattern solves distributed transaction coordination without distributed locks:

    Choreographed sagas fit naturally with event sourcing — each service reacts to events independently, no coordinator required. Orchestrated sagas earn their place when flows grow complex, need synchronous responses, or require per-step control. Compensation provides rollback-like semantics through new events, preserving the full history. Timeout detection catches stuck sagas. Saga state tracking gives you a complete audit trail.

    The result is a system where four independent services coordinate a complex business transaction without any service knowing about the others — they only know about events. The cost is that you’re managing a distributed state machine instead of a database transaction. But the alternative — distributed locks held across network calls while a coordinator decides everyone’s fate — is worse.


    Next up: Saga Timeouts — When Distributed Things Go Wrong

    Previous: Materialized Views for Fast Queries

    Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.
  • Launching a Claude Code Project: Design Before You Build

    Launching a Claude Code Project: Design Before You Build

    I used Claude’s desktop interface for iterative design, then handed off to Claude Code for implementation.


    After deciding to revive my Hazelcast Microservices Framework (MSF) project, and to do so using Claude AI to do much of the heavy lifting, it came down to figuring out how to actually do this. I had no playbook for it. Nobody does, really — we’re all making this up as we go.

    I wanted to be transparent about my use of Claude, and at the same time I think the development process is interesting enough to be worthy of discussion. (Heck, maybe it’s more interesting than the framework blog posts I set out to write.) So I expect to end up with a dual series of blog posts: the framework posts — started by Claude, co-edited together, and given a final polish by me — interleaved with my observations on how the collaboration effort worked.

    This first “behind the scenes” post covers the design phase: going from a vague idea to a set of design documents and an implementation plan, all before writing a single line of code.

    Starting the Conversation

    Here was my original prompt to Claude:

    I want to use Claude Code to help me finish a demonstration project I started some time ago to show how to implement microservices using Hazelcast. (The main value of Hazelcast is to create materialized views of domain objects to maintain in-memory current state.) If it’s more effective, we can restart with a blank sheet rather than modify the existing project. I’d really like to iterate over the design several times before any coding starts — is that best done in Claude Code, or using this desktop interface? Ideally, creating various specifications or design documents before any coding starts would be perfect, if Claude can use these various documents as a guide to the coding process. How do we start?

    Claude immediately suggested splitting the work across two interfaces: use the desktop/web interface for design discussions and document creation, then move to Claude Code for implementation. Made sense to me — the conversational interface is better for back-and-forth design iteration, while Claude Code excels at multi-file code generation with direct access to the project directory.

    This turned out to be excellent advice. The design phase involved a lot of “what about this?” and “actually, let’s reorganize that” — the kind of exploratory conversation that works much better in a chat interface than in a code-focused tool. I tried doing some design work in Claude Code early on and it was noticeably worse — like trying to brainstorm on a whiteboard that keeps trying to compile your diagrams.

    The Design Phase: A Roadmap in Nine Documents

    What followed was an extended design conversation that produced nine documents over the course of a single session. I’m not going to walk through every one in detail — you can follow the links if you’re curious — but a few of them are worth talking about because of what they reveal about the collaboration process.

    Getting Started: Template and Domain

    Claude’s first move was to produce a comprehensive design document template covering everything from executive summary to demonstration scenarios. We never actually completed it — the conversation quickly moved in a more specific direction — but it served its purpose as a structural starting point. The architectural equivalent of a napkin sketch: useful for getting the conversation going, not meant to survive contact with reality.

    Before we could fill in any template, though, we needed to pick a domain for the demonstration. Claude laid out a comparison between eCommerce and Financial Services, and we settled on a hybrid approach: start with eCommerce (universally understood, clear event flows, and I had existing code to reference) but design the framework to be domain-agnostic so other domains could be plugged in later. We also simplified from four services down to three: Account, Inventory, and Order. (A fourth service, Payment, showed up later when we built out the saga patterns. Scope creep, but the useful kind.)

    That decision led to the eCommerce design document — a detailed Phase 1 design covering all three services, their APIs, events, and materialized views. Three view patterns came out of it: denormalized views (joining customer, product, and order data), aggregation views (pre-computing order statistics), and real-time status views (current inventory levels). If you’ve read the previous posts in this series, you’ll recognize these as exactly the kind of thing that makes Event Sourcing + CQRS worth the effort.

    Where I Pushed Back

    The conversation then turned to longer-term goals. I had ideas for observability dashboards, microbenchmarking, pluggable implementations, saga patterns, and more — far beyond what could fit in a Phase 1. Claude organized all of this into a phased requirements document spanning five phases.

    We iterated over this several times, adding and reorganizing. The most significant change I made was moving Event Sourcing from Phase 2 to Phase 1. Claude had initially positioned it as an advanced feature, but I saw it as the fundamental organizing principle of the entire framework — events are the source of truth, not database rows. Once I explained my existing Hazelcast Jet pipeline architecture (where handleEvent() writes to a PendingEvents map, which triggers a Jet pipeline that persists to the EventStore, updates materialized views, and publishes to the event bus), Claude immediately agreed and restructured the phases accordingly.

    This was one of the more interesting moments in the collaboration. Claude had made a reasonable default assumption about complexity ordering, but I had domain-specific knowledge about how the architecture should actually work. The back-and-forth was natural — I explained my reasoning, Claude incorporated it, and the result was better for it. If I’d just accepted the initial phasing without pushing back, the entire project would have been organized around a less coherent architecture. And honestly, I almost did just accept it. It looked reasonable. Sometimes the most important contribution you make is going “wait, actually…” when the first answer seems fine.

    Other additions during this iteration:

    • Vector Store integration (Phase 3, optional) for product similarity search
    • An MCP Server (Phase 3) to let AI assistants query and operate the system
    • Open source mandate — everything in Phases 1-2 must run on Hazelcast Community Edition
    • Blog post series structure — features developed in blog-post-sized chunks

    Architecture, Code Review, and the Rewrite Decision

    The next few documents came quickly. The Event Sourcing discussion led to a dedicated architecture document detailing the Jet pipeline design — based heavily on my existing implementation, but now formally documented with all six pipeline stages, the EventStore design, and how event replay would work.

    Then I uploaded several key source files from the original project for Claude to review: the EventSourcingController, DomainObject, SourcedEvent (later renamed to DomainEvent), EventStore, and EventSourcingPipeline. Claude produced a thorough code review comparing the existing code against the design documents. The verdict was encouraging — the core implementation was solid and matched the Phase 1 design almost perfectly. Claude recommended incremental enhancement: add correlation IDs, framework abstractions, observability, and tests on top of what was already there.

    I went the other way. After thinking about the package naming, dependency versions, and scope of changes needed, I decided on a clean reimplementation using the existing code as a blueprint. This let us start with the right project structure, package names (com.theyawns.framework.*), and dependency versions (Spring Boot 3.2.x, Hazelcast 5.6.0) from the beginning rather than refactoring them in later. Sometimes — as I’d noted in the previous post — the right move is to stop patching the old cabinets and start fresh.

    I won’t pretend this was a purely rational decision. Part of it was just wanting that clean-slate feeling — new project, new structure, no legacy cruft staring at me from the imports. Developers love a greenfield. We can’t help it.

    The Implementation Plan

    Once the architecture was validated and we’d agreed on the approach, Claude created a detailed Phase 1 implementation plan — a three-week, day-by-day schedule with code templates, success criteria, and task checklists:

    • Week 1: Framework core — Maven multi-module setup, core abstractions, event sourcing controller, Jet pipeline
    • Week 2: Three eCommerce services — Account, Inventory, Order with REST APIs and materialized views
    • Week 3: Integration, Docker Compose, documentation, demo scenarios

    We made a few tweaks (updating Hazelcast from 5.4.0 to 5.6.0, for instance), and then it was time to move to code.

    The Handoff to Claude Code

    Claude provided specific instructions for transitioning to Claude Code, including a context block to paste when starting the session:

    I'm building a Hazelcast-based event sourcing microservices framework.
    
    Project location: hazelcast-microservices-framework/
    Current state: Design documents complete, ready for implementation
    
    Key decisions:
    - Clean reimplementation (no existing code to port)
    - Spring Boot 3.2.x + Hazelcast 5.6.0 Community Edition
    - Package: com.theyawns.framework.*
    - Three services: Account, Inventory, Order (eCommerce domain)
    - Event sourcing with Hazelcast Jet pipeline
    - REST APIs only
    
    Implementation plan: docs/implementation/phase1-implementation-plan.md
    
    Starting with Day 1: Maven project setup + core abstractions
    
    Please read the implementation plan and let's begin.

    The whole point of the “design first” approach: you’re not asking the AI to guess at your architecture. You’re handing it a blueprint. The more detailed the blueprint, the less time you spend arguing about load-bearing walls later.

    Documents 7-9: Claude Code Configuration

    Before making the jump, I asked Claude about setup suggestions for Claude Code. This produced three more documents:

    CLAUDE.md (originally called .clinerules — I’m still not sure where that name came from) is the main configuration file that Claude Code reads automatically. It defines code standards, patterns, pitfalls to avoid, and documentation requirements. This file evolved a lot over the course of the project; looking at the commit history gives a good sense of how the “rules” grew and adapted as we ran into new situations. (More on that in a future post — it turned out to be one of the more interesting aspects of the whole process.)

    claude-code-agents.md defined eight specialized agent personas — Framework Developer, Service Developer, Test Writer, Documentation Writer, Pipeline Specialist, and others — each with specific rules, code patterns, and checklists. The idea was to switch between personas depending on the task at hand (e.g., “Switch to Test Writer agent. Write comprehensive tests for EventSourcingController.”). Whether this actually helped or was just a placebo is something I’m still not sure about, honestly.

    A docs organization guide rounded out the set, providing a recommended directory structure for keeping all the documentation organized as the project grew.

    What Came Next

    The resulting project grew well beyond the original three-week Phase 1 plan. At 150 commits, it now includes four microservices (Payment was added for saga demonstrations), an API Gateway, an MCP server for AI integration, choreographed and orchestrated saga patterns, PostgreSQL persistence, Grafana dashboards, and more. The three-week plan took considerably longer than three weeks. So it goes.

    But all of that implementation work — and the interesting stories about how human-AI collaboration played out during coding — is material for future posts.

    What I’d Do Differently (And What I’d Do Again)

    If you’re thinking about using AI for a non-trivial coding project, here’s what I took away from the design phase.

    Use the right tool for each phase. The conversational interface is great for the messy, exploratory work of figuring out what you’re actually building. Claude Code is great for building it.

    Iterate on design before you write code. We went through multiple rounds of revision on the requirements and architecture documents. Each round caught issues or surfaced priorities (like Event Sourcing belonging in Phase 1) that would have been much more expensive to discover during implementation. Measure twice, cut once. The carpenter’s rule exists for a reason.

    Bring your domain knowledge — and don’t be shy about pushing back. Claude made strong default recommendations, but the most valuable moments came when I disagreed based on my understanding of Hazelcast and the architecture I wanted. The AI is a powerful collaborator, but it doesn’t know what you know. If something feels wrong, say so. That’s where the real value of the collaboration happens.

    And document everything. I mean it. The design documents weren’t just planning artifacts — they became living reference material that Claude Code used throughout implementation. The CLAUDE.md file in particular became a continuously evolving guide that shaped code quality across the entire project. Every hour spent on documentation saved multiples in “no, that’s not what I meant” corrections later. I’ve never been great about documentation discipline, so having an AI that actually reads and follows the docs was a surprisingly effective motivator to keep them current.


    The Hazelcast Microservices Framework is open source under the Apache 2.0 license. You can find it at github.com/myawnhc/hazelcast-microservices-framework.

    Next up: what happened when we actually started coding. Spoiler: the plan did not survive intact.

  • Hazelcast Microservices Framework: Event Sourcing Demo

    How a side project connecting Event Sourcing to Hazelcast sat unfinished for years — and why I decided to bring it back with an AI collaborator.


    In my previous post, I shared some of my thinking about Event-Driven Microservices — the coupling problems, the mental shift toward thinking in events, and the patterns (Event Sourcing, CQRS, materialized views) that make it all work. That post was conceptual. This one is personal.

    I’ve been playing around with design concepts in this area for some time. While I was an employee of Hazelcast, I frequently worked with customers and prospects to show how Hazelcast Jet — an event stream processing engine built into the Hazelcast platform — could be used to build event processing solutions that would scale while continuing to provide low latency. These conversations were always framed around stream processing, though. Even when the intended use case was around microservices, we didn’t explicitly get into the Event Sourcing pattern. As someone coming from a background that was database-centric, the concept of events as the source of truth was a bit much for me.

    The Light Bulb Moment

    It was a light bulb moment when I realized that Hazelcast Jet could fit naturally into an Event Sourcing architecture — and that Hazelcast IMDG (the in-memory data grid, or caching layer) could concurrently maintain materialized views representing the current state of domain objects.

    Think about it: Event Sourcing needs an event log and a processing pipeline. Hazelcast Jet is a processing pipeline. CQRS needs a fast read-side store that’s kept in sync with the event stream. Hazelcast IMDG is a fast read-side store. Event Sourcing + CQRS maps beautifully onto Jet + IMDG (even though that acronym is officially retired — it’s all just “Hazelcast” now).

    And from there, I really wanted to demonstrate this. The original Microservices Framework project began.

    Version 1: The Proof of Concept

    The first version was focused on proving the core idea worked. Could I wire up a Hazelcast Jet pipeline to process domain events, persist them to an event store, and update materialized views — all in a way that was generic enough to work across different services?

    The answer was yes. The central pattern that emerged was straightforward: a service’s handleEvent() method writes incoming events to a PendingEvents map, which triggers a Jet pipeline that persists events to the EventStore, updates materialized views, and publishes to an event bus for other services to consume. It worked, and it was fast.

    Now, the central components of the architecture — the domain object, event class, controller, and pipeline — have survived relatively intact through multiple iterations of the implementation. The bones were good. But a lot of the specific implementation choices I made around those bones haven’t aged all that well.

    You know how it goes with side projects. Technical debt accumulates quietly, one “I’ll fix this later” at a time, until you’re looking at a codebase where you know you’d make different choices if you were starting over — but the sunk cost of time already invested keeps you from actually doing it. It’s the software equivalent of a kitchen renovation where you keep patching the old cabinets because ripping them out feels like too big a project for a weekend.

    That version of the framework is still hanging around on GitHub, although I decided not to link to it here as I may take it down at any time. (Upcoming posts will link to the improved version, so embedding links to the original will inevitably lead to someone grabbing the wrong one.)

    I got it to a working state, but there was a long list of things I wanted to add. Saga patterns for coordinating multi-service transactions. Observability dashboards. Comprehensive tests. Documentation that went beyond “read the code.” Each of these was a meaningful chunk of work, and progress slowed to a crawl.

    The Stall

    Let’s be honest about what happened: the project stalled. Not dramatically — it wasn’t ever really abandoned. It just… stopped moving. Every few months I’d open the codebase, when I had some extra time, and make a few minor, inconsequential changes while thinking of the more ambitious refactorings or added features that I’d get to when time permitted.

    If you’ve ever maintained a passion project alongside a day job, you know this feeling. The ideas don’t go away — they sit in the back of your mind, periodically surfacing with a pang of “I should really get back to that.” But the activation energy to restart is high, especially when the next step isn’t a fun new feature but the grind of scaffolding, configuration, and test coverage. So you close the laptop and tell yourself next month will be different. (It won’t be.)

    Enter AI-Assisted Development

    In early 2025, I started using Claude for various coding tasks and was genuinely surprised by the results. This wasn’t autocomplete on steroids — I could describe an architectural pattern and get back code that understood the why, not just the what. I could say “this needs to work like an event journal with replay capability” and get something that actually accounted for ordering guarantees and idempotency.

    That’s when the thought crystallized: what if I could use this to break through the stall?

    Here’s the thing — the stuff that had been blocking me wasn’t the hard design work. I knew what the architecture should look like. The bottleneck was the sheer volume of implementation grind: scaffolding new services, writing comprehensive tests, wiring up Docker configurations, producing documentation. Exactly the kind of work where you need focused hours, and a side project never has enough of those.

    Now, I want to be clear about what I mean here, because “AI wrote my code” carries a lot of baggage. This wasn’t about handing off the project and checking back in when it was done. It was about having a collaborator who could take high-level design direction and turn it into working code at a pace that made the project viable again. I’d provide the domain expertise, the architectural decisions, and the quality bar. The AI would provide the throughput.

    Making the Decision

    I decided to move forward with a clean reimplementation rather than trying to evolve the existing codebase. The core patterns from the original work — the Jet pipeline architecture, the event store design, the materialized view update strategy — were proven and would carry forward. But the project structure, package naming, dependency versions, and framework abstractions would start fresh. Sometimes the best way to fix a kitchen is to actually rip out the cabinets.

    The plan was to use Claude’s desktop interface for iterative design discussions (requirements, architecture, implementation planning) and then hand off to Claude Code for the actual coding. Design first, then build — with comprehensive documentation at every step so the AI would have rich context to work from.

    What happened next — the design phase, the handoff to Claude Code, and the surprises along the way — is the subject of the next post.

    Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.
  • Event-Driven Microservices: Avoiding Distributed Monoliths

    You’ve heard the pitch for microservices. Small, independent services. Teams that can ship without waiting on six other teams to finish their sprint. No more three-month release cycles because somebody touched a shared library. It sounds great — and honestly, the core idea is great. But here’s the thing: a lot of teams adopt microservices and end up with something worse than the monolith they started with.

    I’ve spent the most recent part of my career working with distributed systems, and I’ve seen some of the ways monolith-to-microservice transitions can go awry. A team takes their monolith, draws some boxes around the major modules, splits them into separate services, deploys them independently, and declares victory. Six months later they’re debugging cascading failures at 2 AM and wondering why everything is harder than it used to be.

    What went wrong? They broke the monolith apart without actually decoupling it. And a distributed monolith — where you have all the operational complexity of microservices with none of the benefits — is arguably the worst of both worlds.

    The Coupling Problem

    Let’s be specific about what tight coupling looks like in a microservices architecture, because it’s not always obvious.

    Synchronous request-response everywhere. Service A calls Service B, which calls Service C, which calls Service D. If any one of those services is slow or down, the whole chain stalls. You haven’t built a resilient distributed system — you’ve built a monolith with network hops. And network hops are the worst kind of function calls, because now you get to deal with latency, partial failure, and timeout tuning on top of everything else.

    Shared databases. Multiple services reading from and writing to the same tables. This is the one that sneaks up on people, because the database feels like shared infrastructure rather than a coupling point. But the moment you need to change a schema, you’re coordinating across every service that touches those tables. You’re right back to “deploy everything together or deploy nothing” — which is exactly what microservices were supposed to fix.

    Data format dependencies. Service A produces a message with a certain structure. Services B, C, and D all parse that structure. Now Service A needs to add a field or change a type. Congratulations, you need buy-in from three other teams before you can ship. That’s not independent deployment — that’s a distributed approval process.

    Temporal coupling. Services that have to be running simultaneously to function. If the downstream service isn’t up right now, the upstream service can’t do its job. Your services aren’t really independent if they can only work when everyone else is awake. (Kind of like a group project where one person has to be physically present for anyone else to make progress. We’ve all been in that group project.)

    If any of this sounds familiar, you’re not alone. And the good news is that these problems are well-understood, and there are well-established patterns for solving them.

    Thinking in Events

    Here’s the mental shift that makes the difference: stop thinking about services calling each other, and start thinking about services reacting to things that happen.

    This is event-driven architecture, and at its core it’s about making your software reflect how the real world actually works. The real world doesn’t operate on synchronous request-response. Things happen — a customer places an order, a sensor reads a temperature, a payment clears — and other parts of the system respond to those events on their own terms, at their own pace.

    When you build systems this way, something interesting happens to those coupling problems:

    Synchronous chains disappear. Service A publishes an event. It doesn’t know or care who’s listening. Services B, C, and D each pick up the event and do their thing independently. If Service C is having a bad day, Services A, B, and D don’t notice — they keep right on working.

    Data ownership becomes clear. Each service owns its data, publishes events about what changed, and subscribes to the events it cares about from others. No shared databases, no schema coordination nightmares.

    Temporal coupling goes away. If a service is down when an event is published, that event waits in the stream until the service recovers and processes it. The system degrades gracefully instead of falling over.

    Now, this isn’t magic — you’ve traded one set of challenges for a different set. Event-driven systems have their own complexities: eventual consistency, event ordering, debugging asynchronous flows. We’ll get into all of that. But at least these are the right problems to have — problems that come from genuinely decoupled services rather than from a distributed monolith pretending to be something it’s not.

    Patterns That Make It Work

    If you start exploring event-driven microservices, you’ll quickly run into a set of well-known patterns that have emerged to address the practical challenges. Chris Richardson’s microservices.io is an excellent catalog of these — I’d recommend bookmarking it.

    Two patterns in particular are going to be central to what we explore in this blog, and I’ll admit it took me a while to appreciate how well they fit together:

    Event Sourcing — instead of storing the current state of your data and updating it in place, you store the sequence of events that led to the current state. Every state change is captured as an immutable event in an append-only log. This gives you a complete, auditable history of everything that happened in your system — not just “the account balance is $500” but “here’s every deposit, withdrawal, and transfer that got it there.”

    If you come from a database background (guilty), this feels deeply wrong at first. You mean I don’t just UPDATE the row? I keep every change? Forever? But once you get past the initial discomfort, the power of it becomes obvious. You can reconstruct any past state. You can answer questions you didn’t think to ask when the data was created. You have a complete audit trail for free.

    The catch is also obvious — if you need the current state, do you really have to replay every event from the beginning of time? For a system that’s been running for years, that’s not just slow, it’s unworkable.

    CQRS (Command Query Responsibility Segregation) — and this is where it gets interesting. You separate the write path (commands that produce events) from the read path (queries that serve up current state). The write side stores events. The read side maintains materialized views — pre-computed projections of whatever the read side needs, kept up to date by consuming the event stream.

    See what happens when you put these two together? Event Sourcing gives you the complete, immutable history. CQRS and materialized views give you fast reads without replaying the entire event log every time someone wants to check a balance. Each pattern solves the other’s biggest problem. It’s one of those combinations where the whole is genuinely greater than the sum of the parts — and as we’ll see in later posts, it maps onto certain technology stacks almost embarrassingly well.

    What’s Ahead

    This blog is going to be a hands-on exploration of these ideas — patterns first, then concrete implementations. I’m genuinely excited about this, because I think there’s a gap between the theoretical literature on event-driven architecture (which is excellent) and the practical “here’s how you actually build one” content (which is thinner than you’d expect). In the posts to come, we’ll dig into:

    • Resilience through decoupling — how event-driven systems degrade gracefully instead of cascading failures
    • Auditability and replay — the power of an event log as a source of truth, not just for debugging but for compliance, analytics, and the ability to answer questions you didn’t think to ask yet
    • Independent scalability — scaling the services under load without scaling everything, because your order processing pipeline doesn’t need to drag your user profile service along for the ride
    • Evolvability — adding new consumers of existing events without touching the producers, so your analytics team can tap into a data stream without filing a ticket with the team that owns it

    We’ll look at the patterns in general terms — what problem each one solves, what trade-offs it introduces, how to think about whether it’s the right fit — and then we’ll get into specific, working implementations that you can pull apart, run, and adapt to your own projects.

    If you’re a developer who’s building or maintaining a microservices architecture and found it harder than expected — or if you’re designing a new system and want to avoid the common pitfalls — this series is for you. The patterns are universal; the implementations will be specific. Let’s see where it takes us.

    Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.