Tag: Claude Code

Claude Code Memory: Context, CLAUDE.md, and When to Clear
Part 3 of a series on what I learned shipping BaseballScorer. Part 1 was the arc; Part 2 was the release workflow and the skills that mechanize it. This one is about memory — and more broadly, about the question every Claude Code user ends up arguing about: what goes in the context window, what goes in CLAUDE.md, what goes in persistent memory, and when to clear the whole thing and start fresh.

In late June, Apple’s upload API lied to me. I ran my TestFlight release lane, the build uploaded, and then fastlane reported a failure — a 500 “internal server error” from something called ASSET_SPI. The natural move is to retry the upload. I did, through Apple’s Transporter app, and Apple rejected the retry as a duplicate: the build was already there. The 500 hadn’t been the upload failing. It was Apple’s status check failing after the upload had already succeeded. The error was, not to put too fine a point on it, a lie — and figuring that out cost me a chunk of an evening.

Here’s the part that matters for this post. Two weeks and three releases later, that lesson was still operative. Every subsequent release, the assistant flagged it unprompted: if the lane reports an ASSET_SPI 500, don’t re-upload — verify whether the build actually landed first. I never re-explained it. I never re-derived it. A war story from June was still standing guard in July, across dozens of fresh sessions, each of which started — as every Claude session starts — knowing absolutely nothing about me or my project.

That’s what a memory system buys you. But “use memory” is not actually the interesting advice, because every AI-assisted developer I talk to is wrestling with a more tangled set of questions: How often should I clear my context? Does compaction make the model dumber? Should CLAUDE.md be lean or loaded? What’s the actual difference between telling Claude something in the prompt, putting it in CLAUDE.md, and saving it to memory? People have strong opinions about all of these, usually derived from one bad experience and generalized into doctrine.

So this post is my attempt at a working mental model — the one that’s held up across three-plus months and five TestFlight-and-App-Store releases of BaseballScorer. It’s not doctrine. But it’s been load-tested.

The memory hierarchy

If you’ve been an engineer for more than about a week, you already know this shape: registers, cache, RAM, disk. Small-fast-expensive at the top, big-slow-cheap at the bottom, and the whole game is putting each piece of data in the right tier.

Working with Claude Code has exactly this structure. Four tiers:

Tier 1: The conversation context. This is working memory — everything said and done in the current session, including the contents of every file Claude has read. It’s the most powerful tier, because everything in it directly shapes what the model does next. It’s also finite, expensive, and it decays (more on compaction in a minute). Crucially, influence cuts both ways: stale or wrong material in context doesn’t sit there neutrally. It competes with the truth.

Tier 2: CLAUDE.md. Standing orders. This file is loaded at the start of every session, which makes it the most expensive durable real estate you own — every line you put here is a line Claude reads before every single task, forever. It’s also checked into the repo, which turns out to matter more than it first appears.

Tier 3: Persistent memory. The judgment journal. In my setup this is a directory of small markdown files plus an index — the index is loaded every session (like CLAUDE.md, but for accumulated lessons rather than standing orders), and the detail files are pulled in only when relevant. This is where the ASSET_SPI story lives.

Tier 4: The repo itself. Ground truth. The code, the docs/ directory, the git history, the test suite. Effectively unlimited, fully durable, shared with every collaborator — and, critically, verifiable. Claude can read it fresh anytime and trust what it finds, because it’s not a note about the code; it is the code.

And the rule that organizes everything — if you take one sentence from this post, take this one:

Push every fact down to the cheapest tier that preserves it, and treat the conversation as disposable.

If losing your context window right now would hurt, something important is living in the wrong tier. The conversation is where work happens, not where knowledge lives. The moment something in a session turns out to be durably true — a gotcha, a decision, a preference — it should flow downward: into memory, into CLAUDE.md, into a doc in the repo, wherever it belongs. What remains in the conversation should be only the work in progress.

Baseball version, since this is nominally a baseball-app series: the conversation is what’s in the scorer’s head during the play. CLAUDE.md is the ground-rules card taped inside the scorebook. Memory is the scorebook itself. The repo is the rulebook and the league’s official records. Nobody tries to keep the whole season in their head, and nobody should have to reread the rulebook to remember that the ballpark has a short porch in right.

With the model in place, let’s take the contested questions one at a time.

When should you clear the context?

Liberally, and specifically: between unrelated tasks.

The instinct to preserve a long-running conversation comes from a reasonable place — the model feels smarter mid-session, because it has all that context. And it genuinely is, while the context is relevant. The problem is what happens when you pivot. Finish a gnarly print-layout investigation, then start a networking feature in the same session, and all that layout reasoning is still sitting in working memory. It isn’t neutral filler. It’s noise with authority — hundreds of lines of intermediate hypotheses, half of which were wrong (that’s what investigation is), all still whispering to the model while it tries to think about something else.

Old context doesn’t just waste space. Wrong-but-confident material in context is precisely the raw ingredient hallucinations are made of. The model has no typographic marker distinguishing “conclusion we verified” from “hypothesis we abandoned twenty minutes ago”; both are just tokens it once said.

The discipline that makes clearing cheap is the push-down rule. When the print investigation concluded, the conclusion — “the scorecard print is height-bound, not width-bound; future size complaints should target row heights, not column widths” — went into memory. Two sentences. The eight hundred lines of measurement and dead ends that produced those two sentences got thrown away with the session, unmourned. Next time print size comes up, the two sentences come back and the dead ends don’t. That’s not lost information; that’s distilled information.

If clearing your context feels scary, that fear is diagnostic. It means knowledge is trapped in tier 1 that belongs in tier 3 or 4. Fix the filing, and the fear goes away.

Does compaction hurt accuracy?

Some. Here’s the mechanism, because knowing why tells you what to do about it.

When a session runs long, the harness compacts it: older conversation gets replaced by a summary. Summarization is lossy in a very particular way — it preserves narrative and drops precision. “We fixed the auto-advance bug and merged to main” survives compaction beautifully. The exact tag name, the specific line number, the precise flag that carried the fix — those are exactly the details a summary rounds off.

So the practical rule: after a compaction, trust the story, re-verify the specifics. If post-compaction work depends on an exact value — a version number, a build setting, a function signature — the move is to look it up fresh from the repo (tier 4), not to trust the summary’s recollection of it. Ground truth is one file-read away, and unlike the summary, it can’t have rounded anything off.

I can offer this series itself as evidence. These posts have been written across many sessions with the same assistant, through multiple compactions, spanning weeks of feature work in between. The continuity you’re reading — callbacks to Part 2’s war stories, the running motifs — survived not because the context window is heroic but because everything load-bearing lives in files: the draft posts themselves, a memory note tracking the series plan, the repo’s docs. The conversations were disposable, so losing detail from them cost nothing. The hierarchy is what makes compaction survivable, the same way it makes clearing safe. They’re the same insurance policy.

How much belongs in CLAUDE.md?

Less than you’re putting there, probably — but the reason matters more than the rule.

Every line of CLAUDE.md is read at the start of every session, before every task, for the life of the project. That’s its superpower and its cost. The budget question for any candidate line is: does this change Claude’s behavior often enough to justify being read every single time?

Things that clear that bar, from my actual file: the exact build and test commands (with the environment-variable gotcha that makes them work on my machine); the instruction to read the workflow doc before non-trivial work; the warning to never edit the Xcode project file directly while Xcode is open, because that way lies corruption; a note that a particular category of tooling is unreliable for builds so use the command line instead. Every one of those redirects behavior on a large fraction of tasks. They’ve each paid their rent many times over.

Things that don’t clear the bar: architecture narratives, feature history, aspirational coding standards nobody consults, and anything that reads like documentation. The tell is exactly that — if a section reads like documentation, it is documentation, and it belongs in docs/ with a pointer. My CLAUDE.md doesn’t contain my branching and release policy; it contains one line saying “read docs/workflow.md before starting non-trivial work.” The policy lives in tier 4, where it’s versioned, diffable, and readable by humans too. CLAUDE.md just makes sure Claude knows the pointer exists.

So in the great “edit down vs. fill up” debate: edit down, but not out of minimalist aesthetics — out of budget discipline. It’s the most expensive real estate you own. Spend it on behavior, link to everything else.

Memory vs. CLAUDE.md vs. the prompt

This one has the cleanest answer of the bunch, and it comes down to scope and authorship.

CLAUDE.md is checked into the repo. That makes it true-for-anyone: any collaborator, any future contributor, any other agent that clones the project gets the same standing orders. It describes how to work in this codebase. It’s also curated deliberately — you edit it the way you edit code, on purpose, in commits.

Memory is specific to a collaboration. Mine holds things that would be presumptuous or meaningless in a checked-in file: my preferences (I want a high bar for what earns a point release; I’m skeptical of elaborate persona prompts), corrections I’ve issued and why, the current state of in-flight work (“build 31 is on TestFlight awaiting tester assignment”), lessons that encode judgment rather than procedure. It accrues conversationally — “remember this” mid-session — rather than being edited like a source file. If CLAUDE.md is the ground-rules card, memory is the relationship.

The prompt is for this task only. Anything you find yourself typing into prompts repeatedly is a filing error — it’s a durable fact living in the most ephemeral tier, at the cost of your typing it forever. Promote it: repo-truths into CLAUDE.md, collaboration-truths into memory.

The taxonomy of what earns a memory slot, from three months of practice — three categories carry nearly all the value:
1. Corrections, saved with the why. Not “don’t edit project.pbxproj directly” but “don’t edit it while Xcode is open, because external edits can corrupt Xcode’s in-memory state — ask me to make the change in the Xcode UI instead.” The why is what lets the lesson generalize instead of becoming a cargo-cult rule.
2. Validated approaches. When something works and we confirm it worked, that’s as valuable as a correction. The best example from this project: Siri integration silently failed with one Apple API pattern and worked with another (Part 2 readers will remember the AppEnum saga). The memory doesn’t just say which one won; it says what the failure looked like, so the next occurrence gets recognized in minutes instead of hours.
3. Project state that isn’t in the code. What shipped in which build, what’s awaiting whose decision, what the tester feedback said. Git knows what changed; it doesn’t know what we’re waiting on.
And one anti-category: never save what the repo already records. A memory that duplicates the code is a stale copy waiting to mislead. If Claude can look it up in tier 4, it should — which brings us to the sharpest knife in the drawer.

Memory is not live state

Here’s the discipline that separates a memory system that compounds from one that slowly poisons you: a memory is a point-in-time observation, not a fact about the present.

Code moves. Files get renamed, functions get refactored, flags get removed. A memory that says “the fix is the flag on line 600 of such-and-such service” was true the day it was written and gets falser every week. The rule we run: when a memory names a file, a function, a line, a setting — verify it against the current repo before acting on it. The memory’s job is to point; the repo’s job is to be true.

This is the same principle as the compaction rule, and it’s worth saying why: stale information is worse than no information, because it arrives wearing the costume of authority. A model with no memory of your build system will go read the config and get it right. A model with a confident eight-week-old memory of your build system may not think to check. The failure mode of memory isn’t forgetting — it’s remembering wrong, fluently.

Two small hygiene habits fall out of this. First, absolute dates: a memory that says “last Tuesday” is gibberish in a month, so relative time gets converted to real dates at save time. Second, aggressive pruning: when a memory turns out to be wrong or obsolete, it gets deleted, not annotated. Memory is a working set, not an archive — the archive is git.

The payoff: judgment that compounds

Part 2 argued that skills turn workflows into things that happen the same way every time. Memory does the same thing one level up: it makes judgment repeatable. Every war story costs you once and then pays dividends forever — but only if the distillation is good. Three examples from just the past two weeks of BaseballScorer work, because recency is the point:

The ASSET_SPI lie you already know. One bad evening in June; every release since has carried the antidote in its pocket.

The print investigation ended with a two-sentence memory — “height-bound, not width-bound; target row heights” — that converts every future “can the print be bigger?” request from an afternoon of measurement into a thirty-second answer.

Best of all, the batting-around bug. A live game exposed a display bug: when a team bats around, a player can reach base twice in one inning, and any code that matched events to players without also checking sequence mixed the two trips together. We fixed the two places it bit us. But the memory doesn’t record the fix — the commit records the fix. The memory records the pattern: any player-keyed scan over inning events breaks under batting-around unless it’s sequence-bounded. That’s a lesson about a whole class of latent bugs, some of which probably exist in code we haven’t stressed yet. When one surfaces next April, the diagnosis is pre-loaded.

That’s the compounding: fixes accumulate in the repo, but pattern recognition accumulates in memory. One is what happened; the other is what to watch for.

Tutorial mode, briefly

The prescriptions, in the spirit of the previous posts:
- Treat the conversation as disposable, and act accordingly. Distill conclusions downward the moment they’re conclusions. Then clear without fear, especially between unrelated tasks.
- After compaction, trust the narrative and re-verify the numbers. Exact values should come from the repo, not from a summary’s memory of them.
- Budget CLAUDE.md like the expensive real estate it is. Behavior-changing lines only; anything that reads like documentation moves to docs/ and leaves a pointer.
- Scope decides the tier. True for anyone in the repo → CLAUDE.md. True for this collaboration → memory. True for this task → the prompt. Typing it repeatedly → you’ve filed it wrong.
- Save the why with every correction, and save validations, not just failures. Both halves of the feedback signal matter.
- Verify memories before acting on them. Point-in-time observations, not live state. Stale-but-confident is the failure mode.
- Prune as aggressively as you save. Wrong memories don’t age into harmlessness; they age into ambushes.
What’s next

The final post in this series is the one I most wish had existed when I started: moving from standalone Claude Code in a terminal to the Xcode-integrated version — what’s different, what’s missing, what to do instead. If this post was about where knowledge should live, that one is about where the assistant lives, and it turns out the answer changes more than you’d expect.

That’s where we’ll leave things for today.

Part of an ongoing series at Nodes and Edges. The app is on the App Store, and the companion scoring guide lives at scoring.theyawns.com.
July 8, 2026
Claude Code Custom Skills & fastlane for iOS Releases
Part 2 of a 4-post series on what I learned shipping BaseballScorer. Part 1 was the arc — first commit to App Store in eighteen days. This one is the machinery underneath: the release workflow, and the handful of custom Claude Code skills I actually use.

Here’s a confession to start with, because it sets up everything else in this post: on my pre-retirement Java projects, I had eight specialized Claude agents. I had config-manager and debugging-helper and documentation-writer and framework-developer and performance-optimizer and pipeline-specialist and service-developer and test-writer. Each had its own persona prompt. Each was going to be the expert in its lane. I built a little org chart of robots and felt very clever about it.

In hindsight: overkill. Almost all of it.

On BaseballScorer I have five skills — bug-fix, release, commit, testflight-upload, and security-review — and I’d argue four of them earn their keep and one is borderline. That’s the whole roster. No personas. No “you are a senior iOS architect with twenty years of experience” preamble. The main agent is already a senior iOS architect with twenty years of experience, or near enough; telling it to pretend to be one is theater.

So if you came here for “here are the twelve agents you need to ship an app,” I’m going to disappoint you on purpose. The thesis of this post is that the highest-leverage Claude Code artifacts on a real project aren’t clever — they’re boring. They encode the multi-step, error-prone, do-it-the-same-way-every-time workflows that you’d otherwise wing each Friday and get subtly wrong. A good skill isn’t a personality. It’s a checklist with teeth.

Let me show you what I mean.

What earns a skill

Here’s the test I landed on, after the Java over-engineering taught me what not to do: a workflow earns a skill when it’s multi-step, painful to do by hand, and — this is the one people skip — dangerous to do inconsistently.

That third criterion is where the value actually lives. A one-step task doesn’t need a skill; you just ask. A multi-step task you do once a year doesn’t need a skill; you look it up. But a multi-step task where doing the steps in the wrong order, or skipping one, quietly corrupts something — that’s where you want the steps welded together so neither you nor Claude can freelance them at 11pm.

Releasing a build is the canonical example. So let’s start there.

fastlane: one place that talks to Apple, and only one

Quick detour for anyone who hasn’t met it — and if you’re new to iOS, you probably haven’t: fastlane is an open-source toolkit that automates the tedious parts of shipping an app. Building the archive, signing it, uploading to TestFlight, pushing screenshots and the App Store description, submitting for review — all the steps you’d otherwise do by hand-clicking through Xcode and the App Store Connect website. You write down what you want once, in a file called a Fastfile, as a named recipe (fastlane calls these “lanes”), and then fastlane ios beta runs the whole recipe the same way every time. Think of it as the difference between following a checklist taped to the wall and pressing a single button that does the checklist for you. Until I started this project I didn’t know it existed either; now I’d no sooner ship without it than score a game without a pencil.

With that out of the way: the single most important rule in my release process is this: exactly one thing is allowed to talk to App Store Connect, and that thing is fastlane, driven from a config file in my repo. I do not log into the App Store Connect website and edit the description. I do not tweak the “What’s New” text in the browser because it’s faster. Everything goes through docs/app-store-metadata.md → fastlane → Apple.

I learned this the way you learn most worthwhile rules — by getting burned. Early on, before fastlane owned the metadata, I added a line to my App Store description in the web UI: “no ads, no paywall.” Felt good. Forgot about it. A few weeks later a routine fastlane push regenerated the listing from a doc in my repo — a doc that didn’t have that line — and silently overwrote my edit. No warning, no diff, no “are you sure.” The web edit and the repo doc were two sources of truth, and when two sources of truth disagree, one of them loses, usually the one you forgot you had.

The fix isn’t “remember not to edit the website.” The fix is to make the repo the only source of truth and let the automation be the only writer. Now if I want to change the description, I change the markdown, and fastlane is the courier. There’s exactly one path, so there’s nothing to get out of sync with.

This is a theme, so I’ll name it now and you’ll see it three more times before we’re done: when something bites you because two things can both do the job, the fix is usually to make sure only one thing can.

The beta lane, and the lesson hiding in its control flow

The skill I lean on most is testflight-upload, which runs my fastlane beta lane. On the surface it’s mundane — it bumps the build number, archives, uploads to TestFlight, and tags the release in git. But there’s a design decision baked into the order of those steps that I want to pull out, because it’s the kind of thing that’s invisible when it works and infuriating when it’s done the other way.

My workflow doc has a rule: tag after the upload succeeds, never before. A failed upload should not burn a version tag. That’s easy to say in a doc and easy to violate in practice — you tag, then upload, then the upload dies, and now you’ve got a tag v1.4-b28 pointing at a build that never made it to Apple. Next time you’ll either reuse the tag (don’t) or skip it (now your tags lie about what shipped).

The trick is that in the beta lane, that rule isn’t a comment reminding me to be careful. It’s control flow. The archive and upload_to_testflight calls come first; the commit_version_bump, add_git_tag, and push_git_tags calls come after. If the upload throws, the lane halts — and execution never reaches the tagging code. You cannot burn a tag on a failed upload because the code that creates the tag is downstream of the code that can fail. The “be careful” rule got promoted from a human responsibility to a structural guarantee.

That’s the move I keep coming back to with skills. Anywhere you find yourself writing “remember to X,” ask whether you can instead arrange things so that not doing X is impossible. A reminder is a liability you carry forever. A structural guarantee you build once.

The lane has a couple of other guards in the same spirit. Before it does anything, it checks that you’re on main with a clean working tree (ensure_git_branch, ensure_git_status_clean) — because releasing from a feature branch with uncommitted experiments is a great way to ship something you didn’t mean to. And it auto-generates the TestFlight changelog from git commit messages since the last v* tag, excluding merge commits. That last bit is small but it means my changelog can’t drift from my actual history, because it is my actual history. One source of truth again. You’ll keep seeing it.

The locale crash, or: how an em-dash took down my release

Now for a war story, because abstract principles are easy to nod at and forget.

The first time I ran the beta lane on this machine, it crashed. Not with a useful error — with this:
```
[!] invalid byte sequence in US-ASCII (ArgumentError)
```
followed, a few lines later, by fastlane helpfully informing me that it “requires your locale to be set to UTF-8.” The proximate cause: macOS shells default to a US-ASCII locale, and fastlane’s build step parses xcodebuild‘s output as it streams by. The first non-ASCII byte in that stream — and there’s always one eventually — and the parser falls over.

And here’s the part that’s almost too on the nose: the non-ASCII byte that took down my release was, as often as not, an em-dash. In my own App Store metadata. Which I write full of em-dashes, because — well, you’ve read this far, you’ve noticed. My prose style was crashing my deployment pipeline. There’s a metaphor in there about the cost of having a voice, but I’ll leave it alone.

The first fix was the obvious one: set the locale on the command line every time.
```
LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 /opt/homebrew/bin/fastlane ios beta
```
That works. But look at what it is — it’s a “remember to X.” Every release, forever, I’d have to remember to prefix the command with the magic words, or watch it die on the first smart quote. That’s exactly the kind of carried liability I just spent a section telling you to eliminate.

So the real fix went into the top of the Fastfile itself:
```
ENV["LANG"] = "en_US.UTF-8" unless ENV["LANG"]&.include?("UTF-8")
ENV["LC_ALL"] = "en_US.UTF-8" unless ENV["LC_ALL"]&.include?("UTF-8")
```
Now the trap is disarmed permanently. The lane sets its own locale before it does anything else, so it doesn’t matter what shell I run it from or whether I remembered the incantation. The gotcha can’t recur because the tool defends itself. Same pattern as the tag-on-success thing: take a rule that lived in my head and move it into a place where it’s enforced by code.

If there’s one transferable habit from this whole post, it’s that one. When you hit an environment gotcha, the fix is not to remember it. The fix is to make it impossible to hit again, in the most permanent place you can put the fix.

The bug-fix skill: branch off the buggy tag

The other skill that genuinely changed how I work is bug-fix, and it’s worth explaining because it encodes a habit that I’m told is less common than I’d assumed from my pre-Claude career.

When a bug ships in, say, build v1.3-b26, the fix does not start from current main. It starts from the tag — git checkout -b bugfix/short-name v1.3-b26. You branch from the code that actually shipped the bug.

Why bother? Two reasons, both about honesty. First, the skill makes you write a failing reproducer test before the fix — a test named test_bugfix_<shortDescription> that demonstrates the bug. And a reproducer test is only trustworthy if it reproduces the bug on the code that shipped it. If you write your test against current main, where the symptom may have already shifted or been accidentally masked by other changes, you might write a test that passes for the wrong reason and convince yourself you’ve fixed something you haven’t. Branching from the tag guarantees the test fails for the real reason before it passes for the real reason.

Second, it gives you a clean merge path forward. The fix and its test travel together from the tag up to main, and the reproducer stays in the suite forever as a tripwire against regressions. I’ve got a couple of recent ones from the 1.4 cycle — an error that was getting credited to the wrong team in the box score, and a runner-advancement display that rendered a hanging “advanced to ” with no destination — and in both cases the value wasn’t just the fix. It was that the test which proves the fix is now a permanent member of a 365-test suite that runs before every release. The bug can come back, but it can’t come back quietly.

The discipline of test-first matters even when — especially when — Claude is the one writing the test. It keeps both of us honest about whether we’re fixing the actual bug or just papering over the symptom that happened to be visible. It’s very easy to make a symptom disappear. It’s harder, and more valuable, to prove you understood it.

When the automation breaks (and it will)

I want to close the practical part with the least glamorous lesson, because it’s the one nobody puts in their “ship with Claude!” thread: every piece of automation needs a documented recovery procedure, and that procedure belongs right next to the automation, written while you’re calm.

Two examples from this project, both real, both having cost me an evening.

The beta lane bumps the build number across every target — app, tests, screenshots — before it archives. If it crashes mid-lane (say, on a locale issue before I’d pinned the fix), it’s already dirtied the project file, and the next run’s clean-tree guard refuses to proceed. The first time this happened I flailed. Now there’s a known dance: revert the app target’s build number in Xcode’s UI (not by hand-editing the project file while Xcode’s open — that way lies corruption), commit the leftover diff with a “cleanup from failed run” note, and re-run. The lane re-bumps everything to the next number. Skipping a build number is fine, by the way — Apple only requires that build numbers go up, not that they’re contiguous. That fact alone would’ve saved me twenty minutes of panic if I’d known it.

The second one is sneakier and I love it as a cautionary tale. On one upload, fastlane reported a flat-out failure: an ASSET_SPI 500, “internal server error,” during the post-upload status check. So I did the natural thing and retried the upload through Transporter — which Apple promptly rejected, because the build was already there. The 500 wasn’t the upload failing. It was Apple’s status-check endpoint failing after the upload had already succeeded. The error message was, not to put too fine a point on it, a lie. The only reason I figured it out is that the duplicate-rejection error (bundle version already used) told the truth that the 500 had obscured.

The lesson there isn’t about fastlane specifically. It’s: don’t trust an error message about a remote system’s state — verify the actual state. Apple told me the upload failed. Apple was wrong. The build was sitting in App Store Connect the whole time. When a distributed system reports a failure, it’s reporting that one call failed, which is not the same as the operation having failed, and the gap between those two things is where you lose evenings if you take the error at face value.

(If that distinction sounds familiar, it’s the same reason “the network is unreliable” is the first hard lesson in distributed systems. A failed acknowledgment doesn’t tell you the work didn’t happen. It tells you that you didn’t hear that it happened. Apple’s 500 was a lost ack, nothing more.)

All of this — the recovery dances, the “skip a build number, it’s fine,” the “the 500 is a liar” — lives in a doc in my repo and a couple of memory notes Claude carries between sessions. Which is the natural segue to where we’re headed next.

The actual point

Strip away the war stories and here’s what the five skills and the one config file have in common: none of them make Claude smarter. The model was already plenty smart. What they do is make the process repeatable and the hard-won lessons durable. The locale fix, the tag-on-success ordering, the branch-off-the-tag habit, the single source of truth for metadata — every one of those is a place where a mistake I made once got promoted into something I can’t easily make again.

That’s the unsexy truth about being productive with an AI coding assistant on a real, shipping project. The leverage isn’t in elaborate prompts or a cast of specialized agents with backstories. It’s in noticing which boring workflows are error-prone, encoding them so they happen the same way every time, and turning each war story into a guardrail before you have to fight the same war twice. A skill is just where a hard-won lesson goes to become a habit.

Which raises an obvious question: how does any of that survive across months of work, when each Claude session starts fresh and remembers nothing? How does the lesson from June still be there in September? That’s the persistent-memory system, and it’s the subject of the next post — the same idea as this one, lifted up one level, from “make this workflow repeatable” to “make this project’s accumulated judgment repeatable.” That’s where we’ll leave things for today.

Part of an ongoing series at Nodes and Edges. If you’re curious about the app itself, it’s on the App Store, and the companion scoring guide lives at scoring.theyawns.com.
July 3, 2026
Claude Code for iOS: Shipping a Real App in 18 Days
Part 1 of a 4-post series on what I learned shipping BaseballScorer — from first commit to a usable App Store release in under three weeks, plus everything that’s come after.

I have files on my laptop dated January 7, 2009. They’re the start of an iOS baseball scoring app, written in Objective-C, abandoned partway through the lineup management screens after several other apps beat me to the App Store. I shipped 1.0 of BaseballScorer on April 15, 2026 — about seventeen years and three months later. The gap between those dates isn’t a story about Swift vs. Objective-C. It’s a story about productivity floors.

The 2009 version stalled because building a real iOS app — even one whose design I’d been sketching since the Apple Newton — was a part-time hobbyist’s nightmare. The 2026 version shipped because Claude Code took “build the version of this app I actually want, even though the market is crowded with perfectly good alternatives” from a fantasy into a practical project. The first commit landed on March 28, 2026 — the regular season was about to start. The 1.0 release went live on the App Store eighteen days later, and 1.0 wasn’t a hollow milestone-for-the-sake-of-shipping. It was a genuinely usable scoring app — one you could take to a ballgame and actually score a game with. The two months between 1.0 and 1.3 have been a steady cadence of upgrades, increasingly guided by feedback from real users — both App Store downloaders and folks on the public TestFlight link — rather than by my own backlog. There’s still plenty more in the pipeline.

This is the first of four posts where I try to be honest about what that looked like. Not “Claude wrote my app for me” — that’s not what happened — but a frank account of what I brought, what Claude brought, what went well, what I regret, and what I’d do differently. The next three posts will go deep on (2) the release workflow and the custom Claude Code skills I actually use, (3) the persistent memory system that lets a single Claude conversation feel coherent across months, and (4) the specific differences between running standalone Claude Code in a terminal and the Xcode-integrated version — which is the post I most wish I’d had when I started. This one is the arc.

Why ship into a crowded market?

If you search “baseball scoring app” in the App Store right now you’ll find plenty of decent options. I know, because I checked, repeatedly, every time I asked myself whether this was a sensible use of my time. The honest answer is: no, not by any normal definition of “sensible.” I’m not going to dethrone anyone. Most baseball-scoring app users are loyal to whatever they learned first, and they should be — the existing options work fine.

The reason I built it anyway is the same reason you might build your own task tracker even though Todoist exists. I had a specific mental model of how scoring an iPad baseball game should feel, and none of the existing apps matched it. Some were too “this is a database, please fill it in.” Others tried to be too clever about inferring plays and left me fighting them when I wanted to record something unusual. My design philosophy — which I’ll come back to in a minute — is “the app trusts you.” Everything is optional. Nothing blocks you from moving forward. You can be sloppy and still end up with a usable scorecard, because in the bleachers, sometimes you have to be sloppy.

The other “why now” factor: I’d recently transitioned mostly into retirement, but I was a computer nerd before anyone paid me to be one, so “stop doing tech because no one’s paying me” was never going to be the deal. A project I genuinely wanted to use was the right shape for that phase of life. Side projects without bosses tend to either die fast or finish well, and this one was going to do one of the two.

Here’s the part that’s relevant to the Claude Code angle: that specificity is exactly the kind of thing that used to make “build it yourself” infeasible. Not because the design was hard — most of the design was twenty-plus years old, sitting in my head since the Newton days. It was infeasible because the cost of translating a clear design into working SwiftUI + SwiftData code, with reasonable test coverage and a clean release process, exceeded what I could spend on a side project. Claude Code dropped that cost enough that “build my own version of an app that already exists” went from “fun fantasy” to “actually happening on weekends.”

If you have a personal-version-of-an-existing-app project that you’ve been sitting on, this is the part of the post where I tell you to just start it. You don’t need a market opportunity. You need a productivity floor low enough that doing it for yourself is a reasonable trade for your time.

What came from where

Almost every Claude Code post I’ve read leaves the credit question vague. Mine won’t. Here’s the honest division on BaseballScorer:

From me:

The first two bullets below trace back to a Newton-era bitmap mockup I made decades ago. The rest emerged during this project, mostly from the iPad form factor making certain choices obvious.
- The basic layout — a line score across the top, and then the rest of the screen is the main scoring area with tap targets for the fielding sequence, inning summary down the left, previous at-bats across the top
- The idea of tapping bases to drive baserunner actions
- The “the app trusts you” philosophy — every field optional, an incomplete at-bat never blocks progress, casual scoring is the default. Getting distracted or interrupted and missing a play shouldn’t make it impossible to continue.
- The decision to make portrait orientation the scoring view and landscape the scorecard grid (an iPad-driven call — the Newton mockup had no equivalent)
- The K vs. Kc distinction (swinging strikeout vs. called/looking)
- iPad-primary with iPhone as an adaptive secondary
From Claude, almost entirely:
- The color system for ball / strike / foul / hit-by-pitch. I’d envisioned the buttons monochrome with inapplicable ones dimmed. Claude proposed a color encoding and I liked it immediately. It’s now one of my favorite things about the app.
- Flipping the button set between “pitch results” and “in-play outcomes” depending on the moment in the at-bat. My original design had every button visible all the time with the inapplicable ones grayed out. The flip is better. I didn’t see it.
- Most of the SwiftUI idiom. My only prior iOS App Store release was a collectible-card-game companion app, written in Objective-C years ago — nothing to do with baseball, nothing to do with Swift. BaseballScorer is my first Swift project and my first SwiftUI project. Claude carried me through the language and framework. I had strong opinions about what the UI should do. Claude knew how to make SwiftUI actually do it.
Heavily collaborative:
- The data model. I had an event-sourcing mental model from a separate project, and Claude knew SwiftData’s quirks. We arrived at the current Game → Inning → AtBat → PlayEvent structure together. (We also made an architectural decision there I now regret — more on that below.)
- The release workflow and the custom skills. I brought the discipline; Claude wrote most of the actual fastlane glue and the skill definitions.
- The test discipline. 365 unit tests, zero failing, as of v1.3-b26. I insisted on the failing-reproducer-test-first habit for bug fixes; Claude wrote most of the tests.
This is, I think, the actually-honest shape of a productive human/AI collaboration on a real codebase. It’s not “Claude built it.” It’s not “I built it with Claude as a fancy autocomplete.” It’s a real division of labor where one side brings vision and judgment and the other side brings language fluency and willingness to write the boring parts, and they meet in the middle on the interesting parts.

The structural mistake (and the screenshot that proved it was real)

In early April 2026, between TestFlight builds 6 and 7, a tester I’d never met sent me a screenshot via the public TestFlight link. He was trying to catch up to a live NYY-at-TB game using my MLB-feed catch-up path. The screenshot showed four distinct symptoms in one frame:
1. Three outs filled in on the indicator, but the half-inning hadn’t ended and the active-batter card was still up
2. The active batter card showed Goldschmidt (a Yankees player) while TB was supposed to be batting
3. The runner-action prompt offered “Stay on 3rd” — but the diamond showed no runner on 3B
4. The at-bat history rendered out of chronological order (1st → 5th → 3rd instead of 1st → 3rd → 5th)
Each of those symptoms looked like a different bug. They were not. They were four faces of the same structural problem.

A few months earlier I had written, mostly for my own future reference, a document called docs/architecture-retrospective.md — the kind of “what would I do differently” file you write after a long debugging session, more for catharsis than for action. It listed five “structural pain points” — places where the data model wasn’t wrong exactly, but was generating recurring bug classes rather than one-off bugs. The five pain points it called out:
1. AtBat is doing too much (it’s a historical record and a container for events and a lookup point for rendering)
2. Player identity in events is fragile (SwiftData persistent identifiers can be temporary until the next save — found this out the hard way)
3. State has two implementations (live view-model state vs. reconstructed-from-history state) that drift
4. Catch-up from MLB feed and manual scoring are parallel implementations that diverge subtly
5. Substitution semantics (pinch hitters, pinch runners, defensive substitutions) are tangled across three different storage locations
The retrospective predicted that these pain points would generate exactly the bug classes that the tester’s screenshot demonstrated. Reading the report, I could point at each symptom and say which structural pain it came from. That’s a useful diagnostic moment and a horrible feeling at the same time. The doc had explicitly listed “the same bug class keeps recurring” as a triggering criterion for pulling refactor work forward. The screenshot tripped it.

I pulled three refactors that were scheduled for 1.1 and 1.2 into the 1.0 release, shipped them across builds 10–12, and the entire class of “catch-up shows impossible state” bugs disappeared as a side effect of the refactors rather than as a targeted patch.

The lesson — and this is one of the few times I’m going to be tutorial-mode prescriptive in this post — is write the retrospective doc before you need it. Not as planning. Not as a refactor commitment. As a catch-basin for “this keeps biting me” intuitions, with explicit triggering criteria for when intuition becomes action. Mine sits in the repo at docs/architecture-retrospective.md. When the trigger fires, you don’t have to re-derive the analysis under pressure. You just open the doc and execute the plan you wrote when your head was clear.

I would not have written that doc without Claude. Not because it required AI to write — it didn’t — but because the conversational format of working with Claude generates these documents as a natural side effect of bug-fix sessions. “Tell me what we’re actually fighting here” turns into a doc that I can keep, not a Slack thread that scrolls into oblivion.

The honest regret: two paths that should have been one

Here’s the architectural decision I’d take back if I could.

BaseballScorer can score a game two ways. You can score it by hand, pitch by pitch — the original use case, the one I designed for. Or you can let the app pull from the MLB Stats API and “catch up” to a live game, populating the scorecard from the feed so you can join in mid-game without having to manually backfill the first three innings.

These two paths share almost no code. Manual scoring goes through ScoringViewModel.recordResult / recordPitch / placeRunner and friends. Catch-up goes through MLBAutoFillService.populateFromFeed, which directly mutates the SwiftData models. By the time I noticed this was a problem, both paths had grown enough complexity that unifying them wasn’t a quick refactor.

The cost shows up most clearly in runner advancement. On the manual path, the user has full control — they can move every runner exactly where they need to be. On the catch-up path, if the MLB feed doesn’t surface a runner movement (or we miss one during ingestion), it’s just gone, with no equivalent corrective UI. Two paths, two test surfaces, two places to fix every bug, and a class of “catch-up does X but manual does Y” inconsistencies that I’ve patched at least a dozen times.

If I were starting over, I’d build a typed event log first, and force both paths to produce events that feed a single applier. Both the manual UI and the feed parser would emit the same runnerMovement events; one code path would consume them. The retrospective doc lays this out as a future refactor — possibly worth doing if 1.4’s “Live Game Assistance” theme makes the divergence painful enough — but it would have been trivial to design in on day one and is genuinely hard to refactor in now.

The general lesson, if you want one: when you have two code paths that produce “the same kind of state” through different mechanisms, ask very hard whether they can share a layer. The answer is almost always yes, and almost always you’ll only see how to do it once you’ve already built both.

A few things I would tell you to do

Tutorial mode, briefly, because abstract advice gets nodded at and forgotten:
- Keep custom skills minimal. On my prior Java projects I had eight specialized agents — config-manager, debugging-helper, documentation-writer, framework-developer, performance-optimizer, pipeline-specialist, service-developer, test-writer — each with its own persona prompt. In hindsight: overkill. On BaseballScorer I have five skills (bug-fix, release, commit, testflight-upload, security-review), each tied to a specific recurring multi-step workflow that’s actually painful to do by hand. That’s the right number. If you find yourself writing a skill for “the documentation persona,” that’s a sign your main agent is fine and you’re inventing problems.
- Write the failing test first for bug fixes. Even when Claude is going to write the test for you. The discipline keeps you honest about whether you’re actually fixing the bug or just papering over a symptom. My bug-fix skill enforces this by convention — it won’t write a fix until there’s a test file with test_bugfix_<shortDescription> in it.
- Branch off the buggy build’s tag, not main. When a bug ships in v1.2-b23, the fix branch starts from that tag, not from current main. This guarantees the reproducer test actually reproduces the bug in question, and gives you a clean cherry-pick path back to main once the fix is verified. I thought this was standard practice from my pre-Claude career; I’m told it’s less common than I assumed.
- Make App Store metadata source-of-truth in your repo, not in App Store Connect. I learned this one the hard way. I added some marketing copy (“no ads, no paywall”) directly in App Store Connect and forgot about it. A subsequent fastlane push regenerated the metadata from a doc in my repo and overwrote my edits with no warning. Now docs/app-store-metadata.md is the only thing I touch, and fastlane is the only thing that talks to App Store Connect.
- Write the retrospective doc before you need it. I already preached this one above. I’ll say it again because it’s the highest-ROI habit I’ve adopted on this project.
What’s next

If you’re a baseball scorer — or curious enough about scoring to want to learn — the app is on the App Store, and the companion scoring guide lives at scoring.theyawns.com. The guide is about 20,000 words of “here’s how baseball scoring actually works,” from “what is a 6-4-3?” to the Manager Challenge notation we added in 1.3. If you’re wondering whether to bother learning to score: the app makes it about as low-stakes as it can be, and the guide tries to do the same.

The next post in this series gets into the release workflow — the actual fastlane glue, the custom skills, the gotchas I hit, the time fastlane silently crashed on a non-ASCII byte and I had to learn more about shell locales than I wanted to. The post after that is on the persistent memory system that lets Claude keep coherent context across months of work without me re-explaining the codebase every session. And the final post is the one that’s most specifically for iOS developers: a side-by-side guide to moving from standalone (terminal) Claude Code to the Xcode-integrated version, including the commands and modes that aren’t there and what to do instead. All three will be more concrete and more tutorial-shaped than this one.

That’s where we’ll leave things for today.

Part of an ongoing series at Nodes and Edges. Earlier post in a related vein: Baseball Invented Event Sourcing 150 Years Ago.
June 24, 2026

MCP Server for Microservices: AI-Powered Debugging

Part 6 in the “Building Event-Driven Microservices with Hazelcast” series

Introduction

Over the first five articles, we built an event sourcing framework, a Jet pipeline, materialized views, a choreographed saga pattern, and vector similarity search. That’s a lot of infrastructure. It also means that investigating a problem — say, a failed saga — involves chaining together five or six curl commands across four different services, reading JSON output with your eyes, extracting IDs by hand, and constructing the next request.

Which is fine. It’s what we’ve always done. But there’s a better option now.

The Model Context Protocol (MCP) is an open standard that lets AI assistants — Claude, ChatGPT, Copilot, whoever — call tools exposed by external servers. Instead of the assistant guessing at curl commands or asking you to copy-paste output, it directly queries your materialized views, submits events, inspects saga state, and runs demo scenarios.

In this article, we build an MCP server that bridges AI assistants to our eCommerce microservices. And yes, there is something a little meta about using Claude to build a framework and then building a bridge so Claude can operate the framework. We’re going with it.

Why Give an AI Access to Your Microservices?

Consider a typical debugging session. A saga has failed, and you want to know why:

# Step 1: Find failed sagas
curl http://localhost:8083/api/sagas?status=FAILED

# Step 2: Copy a saga ID from the JSON output
curl http://localhost:8083/api/sagas/saga-a7f3e2

# Step 3: Check the order that triggered it
curl http://localhost:8083/api/orders/ord-12345

# Step 4: Check the event history
curl http://localhost:8083/api/orders/ord-12345/events

# Step 5: Check if stock was released as part of compensation
curl http://localhost:8082/api/products/prod-67890

Five commands. Each one requires reading JSON output, finding the right ID, and constructing the next request. You’re doing the orchestration in your head, and — let’s be honest — that’s exactly the kind of tedious mechanical chaining that humans are bad at and computers are good at.

With MCP, the same investigation is a single sentence:

“Why did the most recent saga fail?”

The AI calls list_sagas(status=”FAILED”), then inspect_saga(sagaId=”saga-a7f3e2″), then get_event_history(aggregateId=”ord-12345″, aggregateType=”Order”), interprets all the responses, and gives you a summary:

“Saga saga-a7f3e2 failed at the payment step. Order ORD-12345 had a total of $15,000 which exceeded the $10,000 payment limit. Compensation ran successfully — stock for product PROD-67890 was released.”

Five tool calls, zero curl commands, a root-cause analysis, and a recommendation. From one question.

What Is MCP?

MCP (Model Context Protocol) is an open specification by Anthropic that defines a standard interface between AI assistants and external tools. Think of it as a contract:

MCP protocol sequence: the AI assistant sends tools/list and tools/call to the MCP server, which returns tool definitions and JSON results over JSON-RPC

The protocol uses JSON-RPC 2.0 over one of two transports:

Transport	How It Works	Best For
stdio	AI assistant launches the server as a subprocess; communicates via stdin/stdout	Local development with Claude Code or Claude Desktop
SSE (HTTP)	Server runs as a web service; AI connects over HTTP with Server-Sent Events	Docker, remote deployment, multi-user

The AI assistant doesn’t need to know anything about Hazelcast, Jet pipelines, or event sourcing. It sees ten tools with descriptions and parameters. The MCP server handles the translation between “query the customer view” and “GET http://account-service:8081/api/customers.”

Designing Tools Around Event Sourcing

The hardest part of building an MCP server isn’t the protocol — it’s deciding what tools to expose. Too many and the AI gets confused about which one to use. Too few and it can’t do useful work. We went back and forth on this and started with seven, organized around the three concerns of an event-sourced system. Three more got added later for dead letter queue recovery, which we’ll get to in a moment.

Queries (Read Current State)

Tool	What It Does
query_view	Read materialized views — current state of customers, products, orders, payments
get_event_history	Read the event log — how an entity reached its current state

These map to the read side of CQRS. Views give you the “what,” event history gives you the “why.”

Commands (Produce New Events)

Tool	What It Does
submit_event	Create customers, products, orders; cancel orders; process payments; refund payments
run_demo	Execute multi-step scenarios (happy path, payment failure, saga timeout, sample data)

Each command produces domain events that flow through the Jet pipeline. run_demo chains multiple commands together to set up investigation targets — a failed payment saga, a timeout scenario, a happy path to compare against.

Observability (Inspect the System)

Tool	What It Does
inspect_saga	View a saga’s status, steps completed, timing, and failure reason
list_sagas	Browse sagas filtered by status
get_metrics	Aggregated system metrics — saga counts, event throughput, active gauges

Dead Letter Queue (Investigate and Replay Failures)

Tool	What It Does
list_dlq_entries	List failed events that landed in the dead letter queue, with a pending-count summary for quick triage
inspect_dlq_entry	View a single DLQ entry: event data, failure reason, saga context, replay count
replay_dlq_entry	Republish a DLQ entry’s event for reprocessing — after the cause is fixed

We hadn’t built the DLQ machinery yet when the MCP server first shipped, so these three were added later. The investigation workflow — list, inspect, then decide to replay or not — turned out to map cleanly onto how a human operator works through a queue of failed events. Asking the AI to walk that with you, one entry at a time, is dramatically less tedious than the curl version.

Ten tools, four categories, no overlap. The AI handles any reasonable question about the system, and tool selection stays reliable — you’d never call get_metrics when you meant query_view, or list_dlq_entries when you meant list_sagas. The shape of the tool decides which question it answers.

Architecture: A Pure REST Proxy

The MCP server sits between the AI assistant and the microservices:

MCP server architecture: an AI assistant connects via the MCP protocol to a Spring Boot MCP server on port 8085, which proxies REST calls to the Account, Inventory, Order, and Payment services

We made a deliberate choice here: the MCP server has no Hazelcast dependency. It doesn’t join any cluster, doesn’t read IMaps, doesn’t run Jet jobs. It’s a thin REST proxy that translates MCP tool calls into HTTP requests against the existing service APIs.

Why go to the trouble of keeping them separate? Because coupling the MCP server to Hazelcast would mean classpath conflicts with the services, a dependency on the data layer that makes testing painful, and another component that needs Hazelcast configuration. As a pure proxy, the server needs maybe 128-256 MB of heap, has no classpath conflicts, and you can test every tool by mocking REST responses without running a single service.

Implementation

The ServiceClient

All HTTP communication goes through one class:

@Component
public class ServiceClient implements ServiceClientOperations {

    private final McpServerProperties properties;
    private final RestClient restClient;

    public Map<String, Object> getEntity(String viewName, String id) {
        String url = resolveUrl(viewName) + "/" + id;
        String json = restClient.get().uri(url).retrieve().body(String.class);
        return parseMap(json);
    }

    String resolveUrl(String viewName) {
        return switch (viewName.toLowerCase()) {
            case "customer" -> properties.getAccountUrl() + "/api/customers";
            case "product"  -> properties.getInventoryUrl() + "/api/products";
            case "order"    -> properties.getOrderUrl() + "/api/orders";
            case "payment"  -> properties.getPaymentUrl() + "/api/payments";
            default -> throw new IllegalArgumentException("Unknown view: " + viewName);
        };
    }
}

That resolveUrl switch is the only place that knows which service owns which view. Every tool delegates to ServiceClient rather than making HTTP calls directly.

The ServiceClientOperations interface exists because Mockito’s inline mock maker on Java 25 cannot mock concrete classes. We hit this wall across the framework — the solution every time was to extract an interface so tests can mock it. It’s a slightly annoying pattern, but it works.

A Tool Implementation

Each tool is a Spring @Service with a @Tool-annotated method. Here’s QueryViewTool:

@Service
public class QueryViewTool {

    private final ServiceClientOperations serviceClient;

    @Tool(description = "Query a materialized view. "
            + "Available views: customer, product, order, payment. "
            + "Provide a key to get a specific entity, or omit to list entities.")
    public String queryView(
            @ToolParam(description = "View to query: customer, product, order, or payment")
            String viewName,
            @ToolParam(description = "Optional: specific entity ID", required = false)
            String key,
            @ToolParam(description = "Max results when listing (default: 10)", required = false)
            Integer limit) {

        if (key != null && !key.isBlank()) {
            return toJson(serviceClient.getEntity(viewName, key));
        } else {
            int effectiveLimit = (limit != null && limit > 0) ? limit : 10;
            List<Map<String, Object>> results = serviceClient.listEntities(viewName, effectiveLimit);
            return toJson(Map.of(
                    "view", viewName,
                    "count", results.size(),
                    "entities", results
            ));
        }
    }
}

That @Tool description is doing real work. The AI reads it to decide which tool to call and what parameters to provide. If you’re vague — “query data” instead of “Query a materialized view. Available views: customer, product, order, payment” — the AI picks the wrong tool or provides wrong parameters. We learned this the hard way. Be specific. Name the available views. Explain what happens with versus without a key.

The optional parameters with defaults matter too. When the AI omits key, the tool lists entities. When it omits limit, you get 10. This lets a single tool handle “show me all customers” and “look up customer cust-123” without the AI needing to figure out everything every time.

Tool Registration

All ten tools get registered in one place:

@Configuration
public class McpToolConfig {

    @Bean
    public ToolCallbackProvider mcpTools(QueryViewTool queryView,
                                         SubmitEventTool submitEvent,
                                         GetEventHistoryTool getEventHistory,
                                         InspectSagaTool inspectSaga,
                                         ListSagasTool listSagas,
                                         GetMetricsTool getMetrics,
                                         RunDemoTool runDemo,
                                         ListDlqEntriesTool listDlqEntries,
                                         InspectDlqEntryTool inspectDlqEntry,
                                         ReplayDlqEntryTool replayDlqEntry) {
        return MethodToolCallbackProvider.builder()
                .toolObjects(queryView, submitEvent, getEventHistory,
                        inspectSaga, listSagas, getMetrics, runDemo,
                        listDlqEntries, inspectDlqEntry, replayDlqEntry)
                .build();
    }
}

Spring AI’s MethodToolCallbackProvider scans each object for @Tool methods and registers them with the MCP server. When the AI calls tools/list, it gets back all ten tool definitions with their descriptions and parameter schemas.

The Event Dispatch Pattern

SubmitEventTool deserves a closer look because it maps a single tool to seven different service endpoints:

Map<String, Object> dispatch(String eventType, Map<String, Object> payload) {
    return switch (eventType) {
        case "CreateCustomer"  -> serviceClient.createEntity("customer", payload);
        case "CreateProduct"   -> serviceClient.createEntity("product", payload);
        case "CreateOrder"     -> serviceClient.createEntity("order", payload);
        case "CancelOrder"     -> {
            String orderId = requireField(payload, "orderId");
            yield serviceClient.performAction("order", orderId, "cancel", payload, true);
        }
        case "ReserveStock"    -> {
            String productId = requireField(payload, "productId");
            yield serviceClient.performAction("product", productId, "stock/reserve", payload, false);
        }
        case "ProcessPayment"  -> serviceClient.createEntity("payment", payload);
        case "RefundPayment"   -> {
            String paymentId = requireField(payload, "paymentId");
            yield serviceClient.performAction("payment", paymentId, "refund", payload, false);
        }
        default -> throw new IllegalArgumentException("Unknown event type: " + eventType);
    };
}

The alternative would be seven separate tools — create_customer, create_product, and so on. We went with a single submit_event tool with an eventType discriminator because it mirrors the event sourcing model (the system is event-driven, the tool should feel event-driven), it keeps the total tool count at ten instead of sixteen, and the AI handles the dispatch naturally. When you say “create a customer named Alice,” it maps that to eventType=”CreateCustomer” without difficulty.

The Demo Tool

RunDemoTool is the most complex tool because each scenario chains multiple service calls:

private Map<String, Object> runHappyPath() {
    // Step 1: Create customer
    Map<String, Object> customer = serviceClient.createEntity("customer", Map.of(
            "name", "Demo Customer",
            "email", "demo-" + shortId() + "@example.com",
            "address", "123 Demo Street"
    ));

    // Step 2: Create product
    Map<String, Object> product = serviceClient.createEntity("product", Map.of(
            "sku", "DEMO-" + shortId(),
            "name", "Demo Widget",
            "price", "29.99",
            "quantityOnHand", 100
    ));

    // Step 3: Create order (uses IDs from previous steps)
    String customerId = extractId(customer, "customerId");
    String productId = extractId(product, "productId");
    Map<String, Object> order = serviceClient.createEntity("order", Map.of(
            "customerId", customerId,
            "customerName", "Demo Customer",
            "lineItems", List.of(Map.of(
                    "productId", productId,
                    "productName", "Demo Widget",
                    "quantity", 2,
                    "unitPrice", 29.99
            ))
    ));

    return Map.of("scenario", "happy_path", "steps", List.of(...));
}

Each scenario uses shortId() — a UUID fragment — so you can run the same scenario multiple times without naming collisions. The payment_failure scenario creates a $16,500 order that exceeds the $10,000 payment limit, triggering saga compensation. The saga_timeout scenario creates an order with minimal stock, designed to hit the deadline. These are pre-built investigation targets — the AI equivalent of a test fixture.

Stdio vs. SSE: Two Transport Modes

Default: stdio (Local Development)

# application.properties
spring.main.web-application-type=none
spring.ai.mcp.server.name=ecommerce-mcp-server

The AI assistant launches the server as a subprocess and communicates via stdin/stdout using JSON-RPC:

stdio transport: Claude Code spawns the MCP server as a java -jar subprocess and communicates over stdin and stdout using JSON-RPC 2.0

No network port needed. This is the default for local development with Claude Code or Claude Desktop.

Docker: SSE/HTTP (Networked Deployment)

# application-docker.properties
spring.main.web-application-type=servlet
spring.ai.mcp.server.stdio=false
server.port=8085

In Docker, the MCP server runs as a web service with Server-Sent Events on port 8085:

mcp-server:
  build: ../mcp-server
  ports:
    - "8085:8085"
  environment:
    - SPRING_PROFILES_ACTIVE=docker
    - MCP_SERVICES_ACCOUNT_URL=http://account-service:8081
    - MCP_SERVICES_INVENTORY_URL=http://inventory-service:8082
    - MCP_SERVICES_ORDER_URL=http://order-service:8083
    - MCP_SERVICES_PAYMENT_URL=http://payment-service:8084

The profile switch is the only difference between the two modes. Same tool code, same behavior.

Testing

Each tool has unit tests that mock ServiceClientOperations:

@ExtendWith(MockitoExtension.class)
class QueryViewToolTest {

    @Mock
    private ServiceClientOperations serviceClient;

    private QueryViewTool queryViewTool;

    @BeforeEach
    void setUp() {
        queryViewTool = new QueryViewTool(serviceClient);
    }

    @Test
    void shouldQueryByKey() throws JsonProcessingException {
        when(serviceClient.getEntity("customer", "c1"))
                .thenReturn(Map.of("customerId", "c1", "name", "Alice"));

        String result = queryViewTool.queryView("customer", "c1", null);

        verify(serviceClient).getEntity("customer", "c1");
        Map<String, Object> parsed = objectMapper.readValue(result, new TypeReference<>() {});
        assertNotNull(parsed.get("customerId"));
    }
}

Eleven test classes cover all ten tools plus the ServiceClient. Add another six for the security layer (more on that below) and one integration suite, and the mcp-server module sits at 143 tests total.

Integration tests use Spring’s ApplicationContextRunner to verify bean wiring without starting the MCP stdio transport (which would block in a test environment):

@DisplayName("MCP Tool Integration")
class McpToolIntegrationTest {

    private final ApplicationContextRunner contextRunner = new ApplicationContextRunner()
            .withConfiguration(AutoConfigurations.of(McpToolConfig.class))
            .withUserConfiguration(TestServiceClientConfig.class)
            .withBean(McpServerProperties.class);

    @Test
    void shouldCreateAllToolBeans() {
        contextRunner.run(context -> {
            assertThat(context).hasSingleBean(QueryViewTool.class);
            assertThat(context).hasSingleBean(SubmitEventTool.class);
            // ... all 10 tools
        });
    }

    @Test
    void shouldRegisterToolCallbackProvider() {
        contextRunner.run(context -> {
            ToolCallbackProvider provider = context.getBean(ToolCallbackProvider.class);
            assertThat(provider.getToolCallbacks()).hasSize(10);
        });
    }
}

Configuration

The MCP server has exactly four configuration properties:

mcp.services.account-url=http://localhost:8081
mcp.services.inventory-url=http://localhost:8082
mcp.services.order-url=http://localhost:8083
mcp.services.payment-url=http://localhost:8084

In Docker, these are overridden by environment variables pointing to container hostnames. That’s it. No Hazelcast configuration, no cluster membership, no pipeline setup.

Setting Up the AI Assistant

Claude Code

Add to ~/.claude/claude_code_config.json:

{
  "mcpServers": {
    "hazelcast-ecommerce": {
      "command": "java",
      "args": ["-jar", "mcp-server/target/mcp-server-1.0.0-SNAPSHOT.jar"]
    }
  }
}

Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "hazelcast-ecommerce": {
      "command": "java",
      "args": ["-jar", "/absolute/path/to/mcp-server-1.0.0-SNAPSHOT.jar"]
    }
  }
}

Docker (SSE)

Point the AI assistant to the SSE endpoint: http://localhost:8085/mcp

In all three cases, the AI discovers the ten tools via tools/list and starts using them immediately.

A Real Conversation

Here’s what a debugging session looks like with MCP connected:

You: “Something’s wrong with recent orders. Can you check?”

Assistant: calls list_sagas(status=”FAILED”)

calls get_metrics()

“I found 2 failed sagas out of 142 total. Your overall success rate is 94.4%. Let me investigate the failures.”

calls inspect_saga(sagaId=”saga-fail-1″)

calls inspect_saga(sagaId=”saga-fail-2″)

“Both failures are payment rejections:

saga-fail-1 (Order ORD-088): Total $15,000 exceeded the $10,000 payment limit. Stock was released via compensation.
saga-fail-2 (Order ORD-091): Total $16,500, same payment limit issue. Stock also released.

These aren’t bugs — the payment service is correctly rejecting orders above the limit. You might want to either raise the limit or add client-side validation to prevent orders over $10,000 from being submitted.”

Five tool calls, zero curl commands, and a root-cause analysis with a recommendation. From one question. I’ll be honest — the first time I watched the AI chain together the right sequence of calls and arrive at a correct diagnosis, it felt a little eerie. Like watching someone drive your car better than you do.

Authentication and Tool Authorization

The first version of this server had no authentication, which is fine for local development and obviously not fine for anything else. So we’ve added API key authentication and role-based tool access — disabled by default to preserve backward compatibility, and enabled with a single property when you need it.

mcp:
  security:
    enabled: true
    api-keys:
      viewer-key-12345: VIEWER
      operator-key-67890: OPERATOR
      admin-key-99999: ADMIN

In HTTP/SSE mode the key arrives in the X-API-Key request header. In stdio mode it’s read from the MCP_API_KEY environment variable. Either way, the server resolves the key to a role, and a ToolAuthorizer checks whether the role is permitted to invoke the tool the AI just asked for.

Three roles are defined:

VIEWER — Read-only. Can call query_view, get_event_history, inspect_saga, list_sagas, get_metrics, list_dlq_entries, and inspect_dlq_entry. Cannot modify state.
OPERATOR — Read plus write. Adds submit_event, run_demo, and replay_dlq_entry.
ADMIN — Same as OPERATOR today, reserved for future admin-only tools.

run_demo is a good example of why the role split matters — it’s the kind of tool you absolutely do not want firing in production, and the default VIEWER key keeps that off the table. The viewer can do everything an SRE wants to do during an incident — query, inspect, look at metrics — but it can’t accidentally place an order.

One layer is still missing: the MCP server authenticates its callers, but it doesn’t forward caller identity to the downstream microservices. For a real production deployment you’d want both. We’ll come back to that.

Where This Goes Next

A few directions we haven’t explored yet.

MCP supports streaming responses, which we’d want for large result sets — listing thousands of events as a single JSON blob isn’t great. MCP also has resources, read-only data endpoints that the AI can reference as context without explicitly calling a tool. The materialized views are a natural fit for that.

OAuth forwarding is the gap mentioned above — the MCP server’s caller identity needs to propagate down to the backend services if we want end-to-end auth in production. The plumbing exists in Spring Security; we just haven’t wired it up.

And with the MCP server as a foundation, you could build specialized AI agents — an operations agent that monitors sagas and flags anomalies, a demo agent that walks users through the system, a testing agent that creates targeted test data and verifies compensation paths. We haven’t built any of these yet, but the tool layer is there.

The MCP server adds a natural-language interface to everything we’ve built so far. Ten tools, a thin REST proxy, two transport modes, role-based authorization, 143 tests. It doesn’t add new capabilities to the data layer — it makes the existing capabilities accessible through conversation. And that turns out to matter more than it sounds like it should. The investigation that took five curl commands now takes one sentence. The demo that required a script and documentation now requires “show me the happy path.” The system that was only inspectable by people who knew the API endpoints is now inspectable by anyone who can ask a question.

That’s where we’ll leave things for today.

Next up: Circuit Breakers and Retry for Saga Resilience

Previous: Vector Similarity Search with Hazelcast

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

June 8, 2026

On the Vector Store I Didn’t Ask For

A short interstitial in the “Building Event-Driven Microservices with Hazelcast” series

AI has been instrumental in bringing this project to fruition — I’m not making any secret of that. The first three posts in this series describe work that was largely pre-existing demo code: domain objects, the Jet pipeline, the materialized view machinery. Claude polished what was already there and helped me write about it. Honest work, but mostly cleanup.

The saga post (post 4) marked a shift — that’s where the demo’s functionality moved into genuinely new territory. And because Hazelcast had recently added a VectorCollection data structure and vector search capability — still in beta at the time — I was eager to incorporate it. So I asked Claude to design and implement something. I should have kept a close eye at every stage; instead I took more of an “I’ll review everything when you’re done” approach.

I was in for a surprise.

What came back was a working vector search implementation. What did not come back was anything built on Hazelcast’s VectorCollection. Claude had built one from scratch — an IMap<String, float[]> for the embeddings, brute-force cosine similarity at query time. No HNSW indexing, no clever data structure, just compute the distance to every vector and sort the results. It worked. The “similar products” endpoint returned plausibly similar products.

This is exactly the thing creating so much fear and doomsaying around AI in the industry. If a coding assistant can reproduce the functionality of an Enterprise software feature — Enterprise edition, additional license cost — in a few hours, is all enterprise software an endangered species?

Not quite. Brute-force cosine similarity is O(n) per query — fine for a demo catalog, fine for a small product line, but not the same animal as Hazelcast’s Enterprise VectorCollection, which uses HNSW indexing to stay sub-millisecond at millions of vectors. That’s real engineering, and it took the Hazelcast team a lot longer than a few hours.

What’s more interesting is that I ended up with both. The accidental implementation became the Community Edition fallback in the framework. The Enterprise implementation took over once I corrected course and built what I’d originally asked for. So the framework now has a VectorStoreService interface with two backends — Enterprise gets HNSW, Community gets brute force, and both work. The Community story is no longer “vector search doesn’t work without a license”; it’s “vector search works fine for modest workloads without a license, and scales seriously if you upgrade.”

I’m not sure I’d have ended up there if Claude had built what I asked for the first time.

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

June 1, 2026
Launching a Claude Code Project: Design Before You Build
I used Claude’s desktop interface for iterative design, then handed off to Claude Code for implementation.

After deciding to revive my Hazelcast Microservices Framework (MSF) project, and to do so using Claude AI to do much of the heavy lifting, it came down to figuring out how to actually do this. I had no playbook for it. Nobody does, really — we’re all making this up as we go.

I wanted to be transparent about my use of Claude, and at the same time I think the development process is interesting enough to be worthy of discussion. (Heck, maybe it’s more interesting than the framework blog posts I set out to write.) So I expect to end up with a dual series of blog posts: the framework posts — started by Claude, co-edited together, and given a final polish by me — interleaved with my observations on how the collaboration effort worked.

This first “behind the scenes” post covers the design phase: going from a vague idea to a set of design documents and an implementation plan, all before writing a single line of code.

Starting the Conversation

Here was my original prompt to Claude:

I want to use Claude Code to help me finish a demonstration project I started some time ago to show how to implement microservices using Hazelcast. (The main value of Hazelcast is to create materialized views of domain objects to maintain in-memory current state.) If it’s more effective, we can restart with a blank sheet rather than modify the existing project. I’d really like to iterate over the design several times before any coding starts — is that best done in Claude Code, or using this desktop interface? Ideally, creating various specifications or design documents before any coding starts would be perfect, if Claude can use these various documents as a guide to the coding process. How do we start?

Claude immediately suggested splitting the work across two interfaces: use the desktop/web interface for design discussions and document creation, then move to Claude Code for implementation. Made sense to me — the conversational interface is better for back-and-forth design iteration, while Claude Code excels at multi-file code generation with direct access to the project directory.

This turned out to be excellent advice. The design phase involved a lot of “what about this?” and “actually, let’s reorganize that” — the kind of exploratory conversation that works much better in a chat interface than in a code-focused tool. I tried doing some design work in Claude Code early on and it was noticeably worse — like trying to brainstorm on a whiteboard that keeps trying to compile your diagrams.

The Design Phase: A Roadmap in Nine Documents

What followed was an extended design conversation that produced nine documents over the course of a single session. I’m not going to walk through every one in detail — you can follow the links if you’re curious — but a few of them are worth talking about because of what they reveal about the collaboration process.

Getting Started: Template and Domain

Claude’s first move was to produce a comprehensive design document template covering everything from executive summary to demonstration scenarios. We never actually completed it — the conversation quickly moved in a more specific direction — but it served its purpose as a structural starting point. The architectural equivalent of a napkin sketch: useful for getting the conversation going, not meant to survive contact with reality.

Before we could fill in any template, though, we needed to pick a domain for the demonstration. Claude laid out a comparison between eCommerce and Financial Services, and we settled on a hybrid approach: start with eCommerce (universally understood, clear event flows, and I had existing code to reference) but design the framework to be domain-agnostic so other domains could be plugged in later. We also simplified from four services down to three: Account, Inventory, and Order. (A fourth service, Payment, showed up later when we built out the saga patterns. Scope creep, but the useful kind.)

That decision led to the eCommerce design document — a detailed Phase 1 design covering all three services, their APIs, events, and materialized views. Three view patterns came out of it: denormalized views (joining customer, product, and order data), aggregation views (pre-computing order statistics), and real-time status views (current inventory levels). If you’ve read the previous posts in this series, you’ll recognize these as exactly the kind of thing that makes Event Sourcing + CQRS worth the effort.

Where I Pushed Back

The conversation then turned to longer-term goals. I had ideas for observability dashboards, microbenchmarking, pluggable implementations, saga patterns, and more — far beyond what could fit in a Phase 1. Claude organized all of this into a phased requirements document spanning five phases.

We iterated over this several times, adding and reorganizing. The most significant change I made was moving Event Sourcing from Phase 2 to Phase 1. Claude had initially positioned it as an advanced feature, but I saw it as the fundamental organizing principle of the entire framework — events are the source of truth, not database rows. Once I explained my existing Hazelcast Jet pipeline architecture (where handleEvent() writes to a PendingEvents map, which triggers a Jet pipeline that persists to the EventStore, updates materialized views, and publishes to the event bus), Claude immediately agreed and restructured the phases accordingly.

This was one of the more interesting moments in the collaboration. Claude had made a reasonable default assumption about complexity ordering, but I had domain-specific knowledge about how the architecture should actually work. The back-and-forth was natural — I explained my reasoning, Claude incorporated it, and the result was better for it. If I’d just accepted the initial phasing without pushing back, the entire project would have been organized around a less coherent architecture. And honestly, I almost did just accept it. It looked reasonable. Sometimes the most important contribution you make is going “wait, actually…” when the first answer seems fine.

Other additions during this iteration:
- Vector Store integration (Phase 3, optional) for product similarity search
- An MCP Server (Phase 3) to let AI assistants query and operate the system
- Open source mandate — everything in Phases 1-2 must run on Hazelcast Community Edition
- Blog post series structure — features developed in blog-post-sized chunks
Architecture, Code Review, and the Rewrite Decision

The next few documents came quickly. The Event Sourcing discussion led to a dedicated architecture document detailing the Jet pipeline design — based heavily on my existing implementation, but now formally documented with all six pipeline stages, the EventStore design, and how event replay would work.

Then I uploaded several key source files from the original project for Claude to review: the EventSourcingController, DomainObject, SourcedEvent (later renamed to DomainEvent), EventStore, and EventSourcingPipeline. Claude produced a thorough code review comparing the existing code against the design documents. The verdict was encouraging — the core implementation was solid and matched the Phase 1 design almost perfectly. Claude recommended incremental enhancement: add correlation IDs, framework abstractions, observability, and tests on top of what was already there.

I went the other way. After thinking about the package naming, dependency versions, and scope of changes needed, I decided on a clean reimplementation using the existing code as a blueprint. This let us start with the right project structure, package names (com.theyawns.framework.*), and dependency versions (Spring Boot 3.2.x, Hazelcast 5.6.0) from the beginning rather than refactoring them in later. Sometimes — as I’d noted in the previous post — the right move is to stop patching the old cabinets and start fresh.

I won’t pretend this was a purely rational decision. Part of it was just wanting that clean-slate feeling — new project, new structure, no legacy cruft staring at me from the imports. Developers love a greenfield. We can’t help it.

The Implementation Plan

Once the architecture was validated and we’d agreed on the approach, Claude created a detailed Phase 1 implementation plan — a three-week, day-by-day schedule with code templates, success criteria, and task checklists:
- Week 1: Framework core — Maven multi-module setup, core abstractions, event sourcing controller, Jet pipeline
- Week 2: Three eCommerce services — Account, Inventory, Order with REST APIs and materialized views
- Week 3: Integration, Docker Compose, documentation, demo scenarios
We made a few tweaks (updating Hazelcast from 5.4.0 to 5.6.0, for instance), and then it was time to move to code.

The Handoff to Claude Code

Claude provided specific instructions for transitioning to Claude Code, including a context block to paste when starting the session:
```
I'm building a Hazelcast-based event sourcing microservices framework.

Project location: hazelcast-microservices-framework/
Current state: Design documents complete, ready for implementation

Key decisions:
- Clean reimplementation (no existing code to port)
- Spring Boot 3.2.x + Hazelcast 5.6.0 Community Edition
- Package: com.theyawns.framework.*
- Three services: Account, Inventory, Order (eCommerce domain)
- Event sourcing with Hazelcast Jet pipeline
- REST APIs only

Implementation plan: docs/implementation/phase1-implementation-plan.md

Starting with Day 1: Maven project setup + core abstractions

Please read the implementation plan and let's begin.
```
The whole point of the “design first” approach: you’re not asking the AI to guess at your architecture. You’re handing it a blueprint. The more detailed the blueprint, the less time you spend arguing about load-bearing walls later.

Documents 7-9: Claude Code Configuration

Before making the jump, I asked Claude about setup suggestions for Claude Code. This produced three more documents:

CLAUDE.md (originally called .clinerules — I’m still not sure where that name came from) is the main configuration file that Claude Code reads automatically. It defines code standards, patterns, pitfalls to avoid, and documentation requirements. This file evolved a lot over the course of the project; looking at the commit history gives a good sense of how the “rules” grew and adapted as we ran into new situations. (More on that in a future post — it turned out to be one of the more interesting aspects of the whole process.)

claude-code-agents.md defined eight specialized agent personas — Framework Developer, Service Developer, Test Writer, Documentation Writer, Pipeline Specialist, and others — each with specific rules, code patterns, and checklists. The idea was to switch between personas depending on the task at hand (e.g., “Switch to Test Writer agent. Write comprehensive tests for EventSourcingController.”). Whether this actually helped or was just a placebo is something I’m still not sure about, honestly.

A docs organization guide rounded out the set, providing a recommended directory structure for keeping all the documentation organized as the project grew.

What Came Next

The resulting project grew well beyond the original three-week Phase 1 plan. At 150 commits, it now includes four microservices (Payment was added for saga demonstrations), an API Gateway, an MCP server for AI integration, choreographed and orchestrated saga patterns, PostgreSQL persistence, Grafana dashboards, and more. The three-week plan took considerably longer than three weeks. So it goes.

But all of that implementation work — and the interesting stories about how human-AI collaboration played out during coding — is material for future posts.

What I’d Do Differently (And What I’d Do Again)

If you’re thinking about using AI for a non-trivial coding project, here’s what I took away from the design phase.

Use the right tool for each phase. The conversational interface is great for the messy, exploratory work of figuring out what you’re actually building. Claude Code is great for building it.

Iterate on design before you write code. We went through multiple rounds of revision on the requirements and architecture documents. Each round caught issues or surfaced priorities (like Event Sourcing belonging in Phase 1) that would have been much more expensive to discover during implementation. Measure twice, cut once. The carpenter’s rule exists for a reason.

Bring your domain knowledge — and don’t be shy about pushing back. Claude made strong default recommendations, but the most valuable moments came when I disagreed based on my understanding of Hazelcast and the architecture I wanted. The AI is a powerful collaborator, but it doesn’t know what you know. If something feels wrong, say so. That’s where the real value of the collaboration happens.

And document everything. I mean it. The design documents weren’t just planning artifacts — they became living reference material that Claude Code used throughout implementation. The CLAUDE.md file in particular became a continuously evolving guide that shaped code quality across the entire project. Every hour spent on documentation saved multiples in “no, that’s not what I meant” corrections later. I’ve never been great about documentation discipline, so having an AI that actually reads and follows the docs was a surprisingly effective motivator to keep them current.

The Hazelcast Microservices Framework is open source under the Apache 2.0 license. You can find it at github.com/myawnhc/hazelcast-microservices-framework.

Next up: what happened when we actually started coding. Spoiler: the plan did not survive intact.
March 6, 2026
Hazelcast Microservices Framework: Event Sourcing Demo

How a side project connecting Event Sourcing to Hazelcast sat unfinished for years — and why I decided to bring it back with an AI collaborator.

In my previous post, I shared some of my thinking about Event-Driven Microservices — the coupling problems, the mental shift toward thinking in events, and the patterns (Event Sourcing, CQRS, materialized views) that make it all work. That post was conceptual. This one is personal.

I’ve been playing around with design concepts in this area for some time. While I was an employee of Hazelcast, I frequently worked with customers and prospects to show how Hazelcast Jet — an event stream processing engine built into the Hazelcast platform — could be used to build event processing solutions that would scale while continuing to provide low latency. These conversations were always framed around stream processing, though. Even when the intended use case was around microservices, we didn’t explicitly get into the Event Sourcing pattern. As someone coming from a background that was database-centric, the concept of events as the source of truth was a bit much for me.

The Light Bulb Moment

It was a light bulb moment when I realized that Hazelcast Jet could fit naturally into an Event Sourcing architecture — and that Hazelcast IMDG (the in-memory data grid, or caching layer) could concurrently maintain materialized views representing the current state of domain objects.

Think about it: Event Sourcing needs an event log and a processing pipeline. Hazelcast Jet is a processing pipeline. CQRS needs a fast read-side store that’s kept in sync with the event stream. Hazelcast IMDG is a fast read-side store. Event Sourcing + CQRS maps beautifully onto Jet + IMDG (even though that acronym is officially retired — it’s all just “Hazelcast” now).

And from there, I really wanted to demonstrate this. The original Microservices Framework project began.

Version 1: The Proof of Concept

The first version was focused on proving the core idea worked. Could I wire up a Hazelcast Jet pipeline to process domain events, persist them to an event store, and update materialized views — all in a way that was generic enough to work across different services?

The answer was yes. The central pattern that emerged was straightforward: a service’s handleEvent() method writes incoming events to a PendingEvents map, which triggers a Jet pipeline that persists events to the EventStore, updates materialized views, and publishes to an event bus for other services to consume. It worked, and it was fast.

Now, the central components of the architecture — the domain object, event class, controller, and pipeline — have survived relatively intact through multiple iterations of the implementation. The bones were good. But a lot of the specific implementation choices I made around those bones haven’t aged all that well.

You know how it goes with side projects. Technical debt accumulates quietly, one “I’ll fix this later” at a time, until you’re looking at a codebase where you know you’d make different choices if you were starting over — but the sunk cost of time already invested keeps you from actually doing it. It’s the software equivalent of a kitchen renovation where you keep patching the old cabinets because ripping them out feels like too big a project for a weekend.

That version of the framework is still hanging around on GitHub, although I decided not to link to it here as I may take it down at any time. (Upcoming posts will link to the improved version, so embedding links to the original will inevitably lead to someone grabbing the wrong one.)

I got it to a working state, but there was a long list of things I wanted to add. Saga patterns for coordinating multi-service transactions. Observability dashboards. Comprehensive tests. Documentation that went beyond “read the code.” Each of these was a meaningful chunk of work, and progress slowed to a crawl.

The Stall

Let’s be honest about what happened: the project stalled. Not dramatically — it wasn’t ever really abandoned. It just… stopped moving. Every few months I’d open the codebase, when I had some extra time, and make a few minor, inconsequential changes while thinking of the more ambitious refactorings or added features that I’d get to when time permitted.

If you’ve ever maintained a passion project alongside a day job, you know this feeling. The ideas don’t go away — they sit in the back of your mind, periodically surfacing with a pang of “I should really get back to that.” But the activation energy to restart is high, especially when the next step isn’t a fun new feature but the grind of scaffolding, configuration, and test coverage. So you close the laptop and tell yourself next month will be different. (It won’t be.)

Enter AI-Assisted Development

In early 2025, I started using Claude for various coding tasks and was genuinely surprised by the results. This wasn’t autocomplete on steroids — I could describe an architectural pattern and get back code that understood the why, not just the what. I could say “this needs to work like an event journal with replay capability” and get something that actually accounted for ordering guarantees and idempotency.

That’s when the thought crystallized: what if I could use this to break through the stall?

Here’s the thing — the stuff that had been blocking me wasn’t the hard design work. I knew what the architecture should look like. The bottleneck was the sheer volume of implementation grind: scaffolding new services, writing comprehensive tests, wiring up Docker configurations, producing documentation. Exactly the kind of work where you need focused hours, and a side project never has enough of those.

Now, I want to be clear about what I mean here, because “AI wrote my code” carries a lot of baggage. This wasn’t about handing off the project and checking back in when it was done. It was about having a collaborator who could take high-level design direction and turn it into working code at a pace that made the project viable again. I’d provide the domain expertise, the architectural decisions, and the quality bar. The AI would provide the throughput.

Making the Decision

I decided to move forward with a clean reimplementation rather than trying to evolve the existing codebase. The core patterns from the original work — the Jet pipeline architecture, the event store design, the materialized view update strategy — were proven and would carry forward. But the project structure, package naming, dependency versions, and framework abstractions would start fresh. Sometimes the best way to fix a kitchen is to actually rip out the cabinets.

The plan was to use Claude’s desktop interface for iterative design discussions (requirements, architecture, implementation planning) and then hand off to Claude Code for the actual coding. Design first, then build — with comprehensive documentation at every step so the AI would have rich context to work from.

What happened next — the design phase, the handoff to Claude Code, and the surprises along the way — is the subject of the next post.

Code: github.com/myawnhc/hazelcast-microservices-framework — clone it, docker-compose up, and the framework boots locally with sample data.

February 26, 2026

Tag: Claude Code

The memory hierarchy

When should you clear the context?

Does compaction hurt accuracy?

How much belongs in CLAUDE.md?

Memory vs. CLAUDE.md vs. the prompt

Memory is not live state

The payoff: judgment that compounds

Tutorial mode, briefly

What’s next

What earns a skill

fastlane: one place that talks to Apple, and only one

The beta lane, and the lesson hiding in its control flow

The locale crash, or: how an em-dash took down my release

The bug-fix skill: branch off the buggy tag

When the automation breaks (and it will)

The actual point

Why ship into a crowded market?

What came from where

The structural mistake (and the screenshot that proved it was real)

The honest regret: two paths that should have been one

A few things I would tell you to do

What’s next

Introduction

Why Give an AI Access to Your Microservices?

What Is MCP?

Designing Tools Around Event Sourcing

Queries (Read Current State)

Commands (Produce New Events)

Observability (Inspect the System)

Dead Letter Queue (Investigate and Replay Failures)

Architecture: A Pure REST Proxy

Implementation

The ServiceClient

A Tool Implementation

Tool Registration

The Event Dispatch Pattern

The Demo Tool

Stdio vs. SSE: Two Transport Modes

Default: stdio (Local Development)

Docker: SSE/HTTP (Networked Deployment)

Testing

Configuration

Setting Up the AI Assistant

Claude Code

Claude Desktop

Docker (SSE)

A Real Conversation

Authentication and Tool Authorization

Where This Goes Next

Starting the Conversation

The Design Phase: A Roadmap in Nine Documents

Getting Started: Template and Domain

Where I Pushed Back

Architecture, Code Review, and the Rewrite Decision

The Implementation Plan

The Handoff to Claude Code

Documents 7-9: Claude Code Configuration

What Came Next

What I’d Do Differently (And What I’d Do Again)

The Light Bulb Moment

Version 1: The Proof of Concept

The Stall

Enter AI-Assisted Development

Making the Decision