A short interstitial in the “Building Event-Driven Microservices with Hazelcast” series
AI has been instrumental in bringing this project to fruition — I’m not making any secret of that. The first three posts in this series describe work that was largely pre-existing demo code: domain objects, the Jet pipeline, the materialized view machinery. Claude polished what was already there and helped me write about it. Honest work, but mostly cleanup.
The saga post (post 4) marked a shift — that’s where the demo’s functionality moved into genuinely new territory. And because Hazelcast had recently added a VectorCollection data structure and vector search capability — still in beta at the time — I was eager to incorporate it. So I asked Claude to design and implement something. I should have kept a close eye at every stage; instead I took more of an “I’ll review everything when you’re done” approach.
I was in for a surprise.
What came back was a working vector search implementation. What did not come back was anything built on Hazelcast’s VectorCollection. Claude had built one from scratch — an IMap<String, float[]> for the embeddings, brute-force cosine similarity at query time. No HNSW indexing, no clever data structure, just compute the distance to every vector and sort the results. It worked. The “similar products” endpoint returned plausibly similar products.
This is exactly the thing creating so much fear and doomsaying around AI in the industry. If a coding assistant can reproduce the functionality of an Enterprise software feature — Enterprise edition, additional license cost — in a few hours, is all enterprise software an endangered species?
Not quite. Brute-force cosine similarity is O(n) per query — fine for a demo catalog, fine for a small product line, but not the same animal as Hazelcast’s Enterprise VectorCollection, which uses HNSW indexing to stay sub-millisecond at millions of vectors. That’s real engineering, and it took the Hazelcast team a lot longer than a few hours.
What’s more interesting is that I ended up with both. The accidental implementation became the Community Edition fallback in the framework. The Enterprise implementation took over once I corrected course and built what I’d originally asked for. So the framework now has a VectorStoreService interface with two backends — Enterprise gets HNSW, Community gets brute force, and both work. The Community story is no longer “vector search doesn’t work without a license”; it’s “vector search works fine for modest workloads without a license, and scales seriously if you upgrade.”
I’m not sure I’d have ended up there if Claude had built what I asked for the first time.
Part 5 in the “Building Event-Driven Microservices with Hazelcast” series
So far we’ve built an event sourcing framework, a Jet pipeline, materialized views, and a saga pattern for distributed transactions. All of that gives us a solid eCommerce backend where orders flow through services, stock gets reserved, payments get processed, and everything recovers gracefully when something goes wrong.
Now we’re going to add something different: “Find me products similar to this one.”
You’ve seen this everywhere. Netflix’s “Because you watched…” Spotify’s discovery playlists. Amazon’s “Customers who bought this also bought…” These features run on vector embeddings — numerical representations of items positioned in high-dimensional space so that similar items cluster together. It sounds exotic, but by the end of this post we’ll have it working in our framework with a real embedding model running locally, no API keys required.
Why Not Just Use Full-Text Search?
If you’ve been doing this for a while, your reflex for “find similar products” is probably full-text search. Elasticsearch, Solr, maybe Postgres full-text indexes. And those are genuinely good tools for what they do — if someone types “gaming laptop,” full-text search finds documents containing the words “gaming” and “laptop.”
But try searching for “portable computer for games.” Or “high-performance notebook for esports.” Semantically identical. Zero shared keywords. Full-text search won’t connect them because it’s matching tokens, not meaning.
Embedding-based similarity works at a different level entirely. A trained model — we’re using all-MiniLM-L6-v2 — has learned from millions of text pairs that “gaming laptop” and “portable computer for games” mean the same thing. It places them near each other in vector space regardless of whether the words overlap. The model doesn’t care about your vocabulary. It cares about your intent.
In production you’d probably combine both approaches: full-text for keyword lookups and structured queries, vector similarity for the “more like this” discovery path. But for recommendations and product discovery, embeddings are the right tool.
What Are Vector Embeddings?
A vector embedding is a fixed-size array of floats that captures an item’s semantic characteristics. Items with similar meaning end up with vectors pointing in similar directions:
“Gaming Laptop” and “Gaming Desktop” are nearby in 384-dimensional space. “Running Shoes” is off in a different neighborhood. You measure similarity by computing the cosine of the angle between two vectors — vectors pointing the same direction score close to 1.0, perpendicular vectors score 0, opposite vectors score -1.0.
Making Embeddings Pluggable
We don’t want the framework married to a specific embedding model. Maybe you’re fine with the default local model. Maybe you need OpenAI’s embeddings for higher accuracy on your domain. Maybe you’ve trained your own. So embedding generation is behind an interface:
public interface EmbeddingProvider {
float[] embed(String text);
int getDimension();
String getModelName();
}
embed() takes text, returns a vector. getDimension() is there so callers can verify compatibility with the vector store’s configured dimension — if you swap models and forget to update the config, you want a clear error, not a silent data corruption. getModelName() is just for logging.
The Default Model
Out of the box, the framework uses LangChain4j’s all-MiniLM-L6-v2. It’s a sentence transformer that runs locally via ONNX Runtime — no API key, no external service, no per-call cost. It produces 384-dimension vectors and captures genuine semantic similarity.
public class LangChain4jEmbeddingProvider implements EmbeddingProvider {
private final AllMiniLmL6V2EmbeddingModel model;
public LangChain4jEmbeddingProvider() {
this.model = new AllMiniLmL6V2EmbeddingModel();
}
@Override
public float[] embed(final String text) {
return model.embed(text).content().vector();
}
@Override
public int getDimension() {
return 384;
}
@Override
public String getModelName() {
return "all-MiniLM-L6-v2";
}
}
The ONNX runtime takes about a second to load on first call, then it’s thread-safe and fast. The auto-configuration creates it as a @ConditionalOnMissingBean — define your own EmbeddingProvider bean and the default steps aside.
I want to be clear: this is a real model, not a demo placeholder. “Gaming Laptop” and “Portable Computer for Games” genuinely show up as similar even though they share almost no words.
HNSW: Searching Vectors Without Scanning Everything
Once you have vectors, you need to search them. The brute-force approach compares your query vector against every stored vector — O(n) per query. Fine for hundreds of products. Not fine for a million.
HNSW (Hierarchical Navigable Small World) is the standard answer. It builds a multi-layer graph over your vector space — think of it like a skip list but for geometric proximity. The top layers have sparse, long-range connections for coarse navigation. The bottom layers have dense, short-range connections for precision. You search by starting at the top, greedily navigating toward the query vector, then dropping down to finer layers. The result is O(log n) search with high recall.
There are three tuning knobs:
Parameter
Controls
Default
maxDegree (M)
Max edges per graph node
16
efConstruction
Beam width during index build — higher means better recall, slower build
200
metric
Distance function: COSINE, DOT, or EUCLIDEAN
COSINE
For a product catalog, the defaults are fine. You’d tune these if you were indexing millions of items and needed to trade off recall against memory or build time.
Storing and Searching Vectors
The vector store is exposed through VectorStoreService:
public interface VectorStoreService {
void storeEmbedding(String id, float[] embedding, Map<String, Object> metadata);
List<SimilarityResult> findSimilar(float[] queryVector, int limit);
List<SimilarityResult> findSimilarById(String id, int limit);
boolean isAvailable();
String getImplementationType();
}
Notice it takes float[]. The vector store doesn’t know or care how the embeddings were generated — the EmbeddingProvider produces vectors, the VectorStoreService stores and searches them. Two concerns, cleanly separated.
SimilarityResult is just a record: (String id, float score, Map<String, Object> metadata).
The Enterprise Path: VectorCollection
Hazelcast Enterprise has a native VectorCollection data structure with built-in HNSW indexing. Our HazelcastVectorStoreService wraps it:
And searching is a single async call to the HNSW index:
@Override
public List<SimilarityResult> findSimilar(float[] queryVector, int limit) {
SearchResults<String, String> searchResults = collection.searchAsync(
VectorValues.of(indexName, queryVector),
SearchOptions.builder()
.limit(limit)
.includeValue()
.build()
).toCompletableFuture().join();
List<SimilarityResult> results = new ArrayList<>();
for (SearchResult<String, String> hit : searchResults) {
Map<String, Object> metadata = jsonToMetadata(hit.getValue());
results.add(new SimilarityResult(hit.getKey(), hit.getScore(), metadata));
}
return results;
}
For comparison — brute-force IMap scans are O(n) per query; HNSW is O(log n). For 1,000 products the difference is negligible. For 1,000,000, it’s the difference between usable and not. That same O(n) IMap scan is exactly what the Community fallback uses, which we’ll look at next.
The Community Path: Brute Force, but It Works
Not everyone has Hazelcast Enterprise. The Community fallback is a SimpleVectorStoreService that stores embeddings in an ordinary Hazelcast IMap and answers queries with a brute-force O(n) cosine scan over every stored vector. The EmbeddingProvider still runs — it’s local ONNX, no license needed — and now the vectors it produces actually get stored and searched. It reports isAvailable() = true, so the similarity endpoint returns real results. It’s less scalable than the Enterprise HNSW path — that O(n) scan grows linearly with the catalog — but for development, testing, and small-to-moderate catalogs it’s fully operational.
How It All Wires Together
The module layout keeps the Enterprise dependency isolated:
The Enterprise auto-configuration uses @AutoConfigureBefore so it registers first. If it creates a HazelcastVectorStoreService bean, the core @ConditionalOnMissingBean sees it and skips the Community fallback. If the Enterprise module isn’t on the classpath — which is the default — the SimpleVectorStoreService takes over.
@Configuration
@EnableConfigurationProperties(VectorStoreProperties.class)
@ConditionalOnBean(HazelcastInstance.class)
public class VectorStoreAutoConfiguration {
@Bean
@ConditionalOnMissingBean(EmbeddingProvider.class)
public EmbeddingProvider embeddingProvider() {
return new LangChain4jEmbeddingProvider();
}
@Bean
@ConditionalOnMissingBean(VectorStoreService.class)
public VectorStoreService vectorStoreService(HazelcastInstance hazelcast,
VectorStoreProperties properties) {
return new SimpleVectorStoreService(hazelcast, properties);
}
}
Build with mvn clean install for Community (the default). Build with mvn clean install -Penterprise to include the Enterprise module. The runtime figures out the rest.
This is the same edition-aware pattern we use for other Enterprise-only features — CP Subsystem, HD Memory, TLS. Define the interface in framework-core, put the Enterprise implementation in framework-enterprise, let Spring’s auto-configuration ordering handle the selection. Community Edition always works. Enterprise features activate when you add the module and license.
From Product Creation to Searchable Embedding
When a product is created, the InventoryService builds a text representation from its name, description, and category, then asks the EmbeddingProvider to turn that into a vector:
private void storeProductEmbedding(final Product product) {
StringBuilder text = new StringBuilder();
text.append(product.getName());
if (product.getDescription() != null) {
text.append(" ").append(product.getDescription());
}
if (product.getCategory() != null) {
text.append(" ").append(product.getCategory());
}
float[] embedding = embeddingProvider.embed(text.toString());
Map<String, Object> metadata = new HashMap<>();
metadata.put("name", product.getName());
metadata.put("category", product.getCategory());
vectorStoreService.storeEmbedding(product.getProductId(), embedding, metadata);
}
The whole thing is wrapped in a try/catch as a best-effort operation. On either edition the embedding gets stored — in the Enterprise VectorCollection or the Community IMap — and if the vector store somehow isn’t available, the failure is swallowed so product creation still succeeds. The similarity feature is additive, not load-bearing.
The REST Endpoint
The Inventory Service exposes a similarity search endpoint:
@GetMapping("/{productId}/similar")
public ResponseEntity<SimilarProductsResponse> findSimilarProducts(
@PathVariable String productId,
@RequestParam(defaultValue = "5") int limit) {
if (!productService.productExists(productId)) {
return ResponseEntity.notFound().build();
}
if (!vectorStoreService.isAvailable()) {
return ResponseEntity.ok(new SimilarProductsResponse(
productId, false,
vectorStoreService.getImplementationType(),
"Vector similarity search is not available.",
List.of()
));
}
List<SimilarityResult> results = vectorStoreService.findSimilarById(productId, limit);
List<ProductDTO> similarProducts = new ArrayList<>();
for (SimilarityResult result : results) {
productService.getProduct(result.id())
.ifPresent(product -> similarProducts.add(product.toDTO()));
}
return ResponseEntity.ok(new SimilarProductsResponse(
productId, true,
vectorStoreService.getImplementationType(),
"Found " + similarProducts.size() + " similar products",
similarProducts
));
}
The response shape is the same regardless of edition — the client gets vectorStoreAvailable: true/false and the getImplementationType() string so it knows which backend answered. Both editions return real similar products with similarity scores; the difference is underneath — Enterprise serves them from an HNSW index in O(log n), while Community runs the brute-force O(n) IMap scan. Same results, different scaling characteristics.
Configuration
All the vector store parameters live under framework.vectorstore in your application YAML:
framework:
vectorstore:
collection-name: product-vectors
dimension: 384 # Must match EmbeddingProvider.getDimension()
max-connections: 16 # HNSW maxDegree (M)
ef-construction: 200 # HNSW build beam width
metric: COSINE # COSINE, DOT, or EUCLIDEAN
index-name: default # HNSW index name
The dimension has to match whatever your EmbeddingProvider produces. The default (384) matches all-MiniLM-L6-v2. If you swap in OpenAI’s text-embedding-3-small (1536 dimensions), update this property or you’ll get index errors that are confusing to debug.
Trying It Out
With the Docker stack running:
# Load sample products (creates embeddings automatically)
./scripts/load-sample-data.sh
# Find products similar to a known product
curl http://localhost:8082/api/products/<product-id>/similar?limit=5
# Or run the demo scenario
./scripts/demo-scenarios.sh 4
Demo scenario 4 detects the edition, looks up a product, calls the similarity endpoint, and displays results with appropriate messaging for either edition.
What’s Next
Bring your own model. The EmbeddingProvider interface makes swapping models easy. Define a bean, return vectors, done:
@Bean
public EmbeddingProvider embeddingProvider() {
return new EmbeddingProvider() {
private final OpenAiEmbeddingModel model = /* your config */;
@Override
public float[] embed(String text) {
return model.embed(text).content().vector();
}
@Override
public int getDimension() { return 1536; }
@Override
public String getModelName() { return "text-embedding-3-small"; }
};
}
OpenAI, Cohere, any Sentence-Transformer via LangChain4j’s ONNX integration — just update framework.vectorstore.dimension to match.
Hybrid search is the obvious next step: “find products similar to this laptop, but only in Electronics and under $1000.” That combines a VectorCollection similarity search with IMap predicate filtering. It’s also exactly the kind of natural-language query that works well with AI-driven orchestration — an LLM can decompose that request into a similarity lookup plus attribute filters, call the right APIs, and merge the results. We’ll get into that in an upcoming post on the Model Context Protocol (MCP), which gives AI models structured access to our microservices.
Multi-modal search is possible too. Hazelcast’s VectorCollection supports multiple named indexes on a single collection — one for text embeddings, another for image embeddings. Same data structure, different similarity dimensions.
Next up: AI-Powered Microservices with the Model Context Protocol
A short interstitial in the “Building Event-Driven Microservices with Hazelcast” series
“It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.” — Mark Twain
Measurement is better than guessing. Who knew?
When the saga implementation was first finished and we ran through the test scenarios for the first time, there was a high incidence of saga timeouts if we ran for more than a few minutes. (5 minutes was great; 30 minutes was ugly.)
I didn’t ask Claude to investigate, or do any analysis of my own, because I had a pretty good suspicion what was going on. Everything was running on a single laptop — 16GB of memory split across a 3-node Hazelcast cluster, 4 services each running an embedded Hazelcast node, Docker itself, and I really hadn’t bothered to shut off my normal desktop workload. Web browser, email, whatever else. I figured I’d maxed the poor thing out and probably wouldn’t get a clean timeout-free run until I deployed to a multi-node cluster in the cloud.
That didn’t happen for some time. I’m cheap, and I wasn’t going to pay for cloud resources until I had a full-blown demo ready to go. When the day came — much later in the story if I was telling it chronologically — I saw the same pattern of timeouts. Turns out, it was never thread starvation or lack of resources. It was a combination of things, and none of them were what I’d assumed.
When faced with this reality, I asked Claude to troubleshoot the issue, and this is one of the times I was most impressed with how Claude approached a problem compared to how I would have.
In most debugging scenarios, I look only until I find the first reasonable suspect. Why keep looking if you’ve already found what you’re looking for? Fix, rebuild, retest, and on a good day, that’s the end of it. On a bad day, you’re still looking at the same issue, so you start hunting for suspect #2. Lather, rinse, repeat.
Claude came back with four identified problems. The main one was subtle: we generated product data up front and gave each item a reasonable starting stock. As the demo ran, orders depleted the stock, and eventually the inventory service started throwing InsufficientStockException — correct behavior, you can’t sell what you don’t have. But the circuit breaker we’d added for resilience was treating that business error the same as an infrastructure failure. Enough “failures” in the sliding window and the circuit breaker tripped open, rejecting all orders — including ones for products that still had stock. Sagas piled up with nowhere to go, the timeout detector found hundreds of them every cycle, and the system drowned in compensation events. At the peak: 64,000 timeouts from 53,000 sagas started.
The other three fixes addressed related gaps. Business failures like out-of-stock now trigger immediate saga compensation instead of waiting for the timeout detector to notice. A NonRetryableException marker interface tells the circuit breaker not to count deterministic business errors against the failure rate. And an automatic stock replenishment monitor keeps the demo in a steady state where orders can actually succeed for hours instead of wedging after the first few minutes.
I should have investigated the saga timeouts when they first appeared, rather than assuming the problem would magically go away with more hardware. And when I did get around to investigating, Claude’s approach of identifying all the contributing problems at once was considerably more effective than my usual one-suspect-at-a-time strategy.
Part 4 in the “Building Event-Driven Microservices with Hazelcast” series
Introduction
In the first three articles, we built an event sourcing framework with a Jet pipeline and materialized views. Each service is self-contained — it owns its events, its views, and its data. And one of the things that makes event-driven architecture powerful is that event publishing is fire-and-forget. The publisher doesn’t know or care who’s receiving the messages, or whether anyone acted on them properly.
But someone cares. Probably your boss.
That decoupling is a feature, not a bug — it’s exactly what gives you the independence to evolve services separately. If you wanted to add a loyalty program based on orders placed, you wouldn’t touch a single line of existing code. You’d build the new functionality and subscribe it to OrderCreated events. Done. No coordination, no release trains, no “can you add a field to your response for us.”
But that same independence becomes a problem when a business operation requires coordination. Placing an order in our eCommerce system spans four services:
Order Service creates the order
Inventory Service reserves stock
Payment Service charges the customer
Order Service confirms the order
If payment fails after stock is reserved, we need to release the stock. If the inventory service is down, we need to cancel the order. The fire-and-forget publisher doesn’t know any of this happened — and nobody else is keeping track unless we build something that does.
Now — a real eCommerce system wouldn’t cancel an order just because the inventory service hiccupped. You’d create a backorder, or queue the reservation for retry, or do whatever it takes to keep the customer’s money. Nobody in the business is going to say “yeah, let’s give up on that sale because a container restarted.” But we’re building a framework to demonstrate microservice patterns, not to compete with Shopify. The cancel-and-compensate flow shows sagas at their most illustrative: forward steps, failure detection, compensation, and state tracking across services. That’s the machinery we want to examine. So we went with the version that best shows how the pattern works, not the version that best sells laptops.
This is the distributed transaction problem, and sagas are the solution.
Why Sagas? Because the Alternative Is Worse.
We didn’t adopt event sourcing because it was fashionable. We adopted it because the alternative — mutable state scattered across services with no audit trail — was worse. Sagas are the same kind of choice.
When a business operation touches one database, you wrap it in a transaction. ACID gives you atomicity: either all the changes commit or none of them do. Every developer learns this in their first database course, and it works.
But our order fulfillment touches four services, each with its own data. A single database transaction can’t span them. So you reach for two-phase commit — 2PC — the textbook answer for distributed transactions. And that’s where the trouble starts.
2PC works by appointing a coordinator that asks each participant “can you commit?” and then, once everyone says yes, tells them all to go ahead. The problem is what happens between those two phases. Every participant is holding locks — on inventory rows, on payment records, on order state — while waiting for the coordinator’s decision. In a monolith, that wait is microseconds. Across a network, it’s milliseconds at best, and if the coordinator is slow or the network partitions, those locks can be held for seconds. Or longer.
Think about what that means for throughput. While one order is sitting in that limbo state between “prepare” and “commit,” no other transaction can touch those same inventory rows. At 50 orders per second, you’ve got 50 sets of distributed locks competing for the same resources, each one waiting on network round-trips to a coordinator that is itself a single point of failure. If the coordinator crashes mid-protocol, those locks stay held until someone intervenes manually. The participants are stuck — they’ve promised to commit but haven’t been told to, and they can’t unilaterally release the locks without risking inconsistency.
It’s not that 2PC is theoretically wrong. It works fine in environments where all participants are on the same local network, latency is sub-millisecond, and the coordinator never fails. That environment is called “a single database server.” Once you’ve distributed beyond that, you’re fighting physics.
Sagas take a fundamentally different approach. Instead of one big atomic transaction with distributed locks, you execute a sequence of local transactions — each one fast, each one touching only one service’s data, each one committing immediately. If step 3 fails, you don’t roll back steps 1 and 2 by releasing locks. You compensate — you issue new transactions that undo the business effect of the earlier steps. Reserve stock, then charge payment, then… payment fails? Issue a stock release. The saga doesn’t pretend the work never happened. It acknowledges what happened and fixes it forward.
The trade-off is that you give up the clean atomicity of “all or nothing” and accept eventual consistency — a window of time where stock is reserved but payment hasn’t been attempted yet, or payment has been charged but the order isn’t confirmed. In our system, that window is typically under a second. For the user, the order is “processing.” For the system, the saga is in flight. And if something fails, compensation events fire and the system converges to a consistent state — just not instantaneously.
Choreography and Orchestration
There are two ways to coordinate a saga. They represent genuinely different architectural philosophies, and neither is universally right.
Choreography: Services React to Events
In a choreographed saga, there’s no coordinator. Each service listens for events that concern it and reacts:
Order Service publishes OrderCreated -->
Inventory Service hears it, reserves stock, publishes StockReserved -->
Payment Service hears it, charges customer, publishes PaymentProcessed -->
Order Service hears it, confirms order
The flow emerges from the interactions between services. No single component knows the full sequence. Each service knows only “when I see event X, I do Y and publish Z.”
Orchestration: A Central Controller Directs the Flow
In an orchestrated saga, one component — the orchestrator — knows the whole sequence and directs each step:
The orchestrator is a state machine. It tracks where the saga is, what comes next, and what to undo if something fails.
Choosing Between Them
Choreography is the simpler choice when your saga is a linear chain — A triggers B triggers C triggers D — and the services are already publishing events for other reasons. If your system is event-driven (ours is), choreographed sagas are almost free. The services are already emitting events. The saga is just another consumer. No new infrastructure, no single point of failure, and each service stays fully independent.
But choreography gets painful as the flow gets complex. Say step 2 needs to branch — charge a credit card or apply store credit, depending on the payment method. Now you’re encoding conditional logic across multiple listeners that can’t see each other. If someone asks “what does this saga actually do, end to end?” the answer is “go read three listener classes in three different services and piece it together.” For a four-step linear chain, fine. For a ten-step flow with branches and conditional skips, that’s a mess.
That’s where orchestration earns its keep. The entire saga — forward steps, compensation steps, timeouts, retry logic — lives in a single definition file. You read it top to bottom. You can add per-step timeouts, per-step retries. The caller can wait for the result synchronously. The cost is coupling: the orchestrator needs to know about every service’s endpoint, and it becomes a component you have to keep running.
A rough rule of thumb:
Start with choreography when the saga is a linear chain with fewer than about five steps, the services already publish events, and you don’t need a synchronous response. Move to orchestration when you need branching or conditional logic, per-step timeout and retry control, a synchronous success/failure response for the caller, or when the saga has gotten complex enough that nobody can trace it across listeners without a whiteboard.
We started with choreography for order fulfillment because it’s a clean four-step chain and our entire architecture is already built around events. Later, when we needed synchronous responses for certain API paths and wanted per-step timeout control, we added orchestration as a second option. Both patterns coexist in the framework, running against the same saga state store. We’ll cover the orchestration implementation and how we run both side by side in a later article.
The Order Fulfillment Saga
Here’s the full flow for the choreographed version:
Event Context Propagation
Every saga event carries metadata that links the chain together:
// These fields are present on every saga event
String sagaId; // Links all events in one saga instance
String correlationId; // Links to the original API request
String sagaType; // "OrderFulfillment"
int stepNumber; // Which step (0, 1, 2, 3)
boolean isCompensating; // True for compensation events
The sagaId is generated when the order is created and propagated through every subsequent event. That’s how the Inventory Service knows which saga a StockReserved event belongs to, and how the Payment Service gets the payment details — amount, currency, method — from the StockReserved event’s context fields.
Implementing Saga Listeners
Each service has a saga listener that subscribes to relevant events via Hazelcast ITopic. Here’s the Payment Service’s:
The listener uses @Qualifier(“hazelcastClient”) — it connects to the shared cluster, not the embedded instance. That’s the dual-instance architecture from Part 2. Each listener is a plain @Component that Spring creates on startup; the ITopic subscription stays active for the life of the service. And the listener itself doesn’t contain business logic — it unpacks the event, delegates to the service layer, and gets out of the way.
Compensation: Undoing Work
When a step fails, we need to undo the work completed by previous steps. This is compensation — often described as the saga equivalent of a rollback, though it isn’t really a rollback at all. More on that in a minute.
Compensation Flow: Payment Fails
Compensation Flow: Stock Unavailable
StockReservationFailed event
|
'---> Order Service
'-- Cancels order (status: CANCELLED)
'-- No stock release needed (nothing was reserved)
Saga finalized as COMPENSATED
The Compensation Registry
A CompensationRegistry maps each forward event to its compensating event and responsible service:
@Configuration
public class ECommerceCompensationConfig {
@Bean
public CompensationRegistrar ecommerceCompensations(CompensationRegistry registry) {
// Step 0: OrderCreated --> compensate with OrderCancelled
registry.register("OrderCreated", "OrderCancelled", "order-service");
// Step 1: StockReserved --> compensate with StockReleased
registry.register("StockReserved", "StockReleased", "inventory-service");
// Step 2: PaymentProcessed --> compensate with PaymentRefunded
registry.register("PaymentProcessed", "PaymentRefunded", "payment-service");
// Step 3: OrderConfirmed --> no compensation (terminal success)
return () -> {};
}
}
This is the one place that documents the entire saga structure. Even though the execution is distributed across listeners, the registry makes the step-to-compensation mapping readable in a single file. That’s something people miss about choreography — it doesn’t mean the structure is undocumented, it means the structure is declared separately from the execution.
Saga State Tracking
The SagaState class is an immutable state machine that tracks each saga instance:
public class SagaState implements Serializable {
private final String sagaId;
private final String sagaType; // "OrderFulfillment"
private final String correlationId;
private final SagaStatus status; // STARTED --> IN_PROGRESS --> COMPLETED
private final List<SagaStepRecord> steps;
private final Instant startedAt;
private final Instant deadline; // Absolute timeout
}
Status transitions:
STARTED --> IN_PROGRESS --> COMPLETED (happy path)
STARTED --> IN_PROGRESS --> COMPENSATING --> COMPENSATED (failure + recovery)
STARTED --> IN_PROGRESS --> TIMED_OUT --> COMPENSATING --> COMPENSATED (timeout)
STARTED --> IN_PROGRESS --> FAILED (unrecoverable)
The state lives in a HazelcastSagaStateStore backed by an IMap on the shared cluster. Every service can read and update saga state because they all connect to the same shared cluster via the client instance.
Each step gets recorded:
public class SagaStepRecord implements Serializable {
private final int stepNumber;
private final String eventType;
private final StepStatus status; // PENDING, COMPLETED, FAILED, COMPENSATED
private final Instant completedAt;
private final String failureReason;
}
Timeout Handling
Sagas can get stuck. A service might be down, a message might be lost, or a listener might throw an unhandled exception. Without timeout detection, a stuck saga hangs forever — stock reserved but never charged, or charged but never confirmed.
The SagaTimeoutDetector is a scheduled service that runs in each service instance:
@Component
public class SagaTimeoutDetector {
@Scheduled(fixedDelayString = "${saga.timeout.check-interval:5000}")
public void detectTimeouts() {
List<SagaState> timedOut = sagaStateStore.findTimedOutSagas(
maxBatchSize
);
for (SagaState saga : timedOut) {
sagaStateStore.markTimedOut(saga.getSagaId());
if (autoCompensate) {
compensator.compensate(saga);
}
applicationEventPublisher.publishEvent(
new SagaTimedOutEvent(saga)
);
sagaMetrics.recordSagaTimedOut(saga.getSagaType());
}
}
}
Every five seconds, it looks for sagas that have blown past their deadline. When it finds one, it marks it timed out, triggers compensation for whatever steps already completed, publishes a Spring event for logging and alerting, and records the metric.
Timeout behavior is configurable per service:
saga:
timeout:
enabled: true
check-interval: 5000 # Check every 5 seconds
default-deadline: 30000 # 30-second default timeout
auto-compensate: true # Auto-trigger compensation
max-batch-size: 100 # Process up to 100 timeouts per check
saga-types:
OrderFulfillment: 60000 # 60 seconds for order fulfillment
The Order Fulfillment saga gets 60 seconds — longer than the 30-second default — because it spans four services and includes payment processing, which can be slow. Choosing the right timeout value is harder than it sounds, and we have quite a bit more to say about that in the next article.
Monitoring Sagas
The SagaMetrics class exposes counters and timers to Prometheus:
Metric
What It Tells You
saga.started
How many sagas are being created
saga.completed
How many succeed end-to-end
saga.compensated
How many required rollback
saga.timedout
How many exceeded their deadline
saga.duration (p95)
How long sagas are taking
saga.compensation.duration (p95)
How long compensation takes
The pre-provisioned Saga Dashboard in Grafana visualizes active saga counts, throughput rates by saga type, duration percentiles, success rates, timeout detection rates, and compensation breakdowns. Pre-configured alerts fire on high failure rates, timeouts, compensation failures, and success rate drops below 90%.
The dashboard is useful. But the metric that actually saved us during a sustained load incident was dead simple: saga.timedout plotted alongside saga.completed. When timeouts exceeded completions, we knew we had a systemic problem, not just a few slow sagas. More on that story next time.
Design Decisions and Trade-offs
Eventual Consistency
Sagas provide eventual consistency, not immediate consistency. Between the time stock is reserved and payment is processed, the system is in a “pending” state. That’s intentional — the trade-off buys us service independence and availability. For our use case, the consistency window is sub-second. For domains where that’s not acceptable (financial settlement, medical records), you’d need stronger guarantees.
Idempotency
Saga listeners must be idempotent. ITopic delivery is at-most-once in Hazelcast, but the timeout detector might trigger compensation for a saga that was actually processing — just slowly. If the original flow completes after compensation starts, the system has to handle the overlap gracefully.
SagaState handles this with updateOrAddStep, which replaces by step number. A step can’t be recorded twice.
Compensation Is Not Rollback
This is worth being explicit about, because the terminology invites confusion.
Compensation undoes the business effect, not the technical state. When we “refund a payment,” we don’t delete the payment record. We create a new PaymentRefundedEvent that changes the payment status to REFUNDED. The event history preserves the full story: payment was processed, then refunded. If an auditor comes asking, the trail is complete.
This fits naturally with event sourcing. The compensation is just another event. Nothing is erased, nothing is pretended away. The system records what actually happened, including the part where something went wrong and was corrected.
Summary
The saga pattern solves distributed transaction coordination without distributed locks:
Choreographed sagas fit naturally with event sourcing — each service reacts to events independently, no coordinator required. Orchestrated sagas earn their place when flows grow complex, need synchronous responses, or require per-step control. Compensation provides rollback-like semantics through new events, preserving the full history. Timeout detection catches stuck sagas. Saga state tracking gives you a complete audit trail.
The result is a system where four independent services coordinate a complex business transaction without any service knowing about the others — they only know about events. The cost is that you’re managing a distributed state machine instead of a database transaction. But the alternative — distributed locks held across network calls while a coordinator decides everyone’s fate — is worse.
Next up: Saga Timeouts — When Distributed Things Go Wrong
A short interstitial in the “Building Event-Driven Microservices with Hazelcast” series
Originally, the fourth post in this series was about observability — metrics, tracing, dashboards. Because the posts were being written alongside the code, it only covered what existed at the time: the event sourcing and saga metrics. As the framework grew, metrics got added in a lot of places — persistence, resilience features, a business-oriented dashboard.
A post on observability that covered a fraction of the actual observability felt incomplete, so it was expanded and moved later in the sequence. It’ll be around post 12.
But the blog post moved. The implementation didn’t. Observability isn’t something you bolt on after everything else is working — it’s how you know whether everything else is working. If you’re building a multi-phase project like this, instrument early. You’ll be glad you did when something breaks at 2 AM on phase 3 that would have been obvious from a dashboard you built in phase 1.
Part 3 in the “Building Event-Driven Microservices with Hazelcast” series
Introduction
We’ve established that events are the source of truth (Part 1) and built a Jet pipeline to process them (Part 2). But we’ve been dancing around a question that anyone who’s thought about event sourcing for more than five minutes will ask:
If everything is stored as a sequence of events, how do you actually query anything?
The Problem with Raw Events
The naive answer is “replay them.” Want the current state of customer 123? Walk through every event that ever happened to customer 123 and apply them in order:
This works. It’s also terrible. A customer with 10 events takes 10x longer to load than a brand-new customer with 1. A customer who’s been active for three years and has 1,000 events? Good luck serving that from an API endpoint with a latency SLA.
You need O(1) lookups, not O(n) replays on every GET request.
CQRS: The Pattern You Don’t Get to Opt Out Of
This is where event sourcing stops being a standalone pattern and starts demanding an architectural partner.
In a CRUD system, one table does everything. You INSERT into it, UPDATE it, and SELECT from it. The read model and the write model are the same thing — a row in a database. Simple.
Event sourcing breaks that. Your write model is an append-only event log. Great for durability, great for auditing, terrible for queries. You can’t SELECT a customer’s current address from a log of every change that’s ever happened to them. Not efficiently, anyway.
So you need a separate read model. A structure that’s optimized for the queries your application actually needs. This separation — commands go to the write model, queries go to the read model — is CQRS: Command Query Responsibility Segregation.
We didn’t choose CQRS because it appeared on a list of architecture buzzwords. Event sourcing forced us into it. Once your source of truth is an event log, you need a separate read path. There’s no way around it.
Now, CQRS is a general pattern. The read model could be a relational database you project events into for complex SQL queries. It could be Elasticsearch for full-text search, or a data warehouse for analytics. In our framework, we project into Hazelcast IMaps — materialized views that update in real-time as events flow through the Jet pipeline. The IMap gives us sub-millisecond lookups and keeps the read model co-located with the processing engine. No network hop, no separate database to manage.
Materialized views are our implementation of the read side of CQRS.
What is a Materialized View?
A materialized view is a pre-computed projection. Instead of computing state on every query, you compute it once — when the event is processed in Stage 4 of the pipeline — and store the result. Queries just look up the stored result. In event sourcing literature, the component responsible for maintaining these projections is sometimes called a projection engine. In our framework, that’s the Jet pipeline from Part 2.
The write path pays the cost once. Every subsequent read is free. The more reads you have per write — and most systems are read-heavy — the bigger the win.
The ViewStore
The HazelcastViewStore wraps an IMap:
public class HazelcastViewStore<K> {
private final IMap<K, GenericRecord> viewMap;
private final String viewName;
public HazelcastViewStore(HazelcastInstance hazelcast, String viewName) {
this.viewName = viewName + "_VIEW";
this.viewMap = hazelcast.getMap(this.viewName);
}
public Optional<GenericRecord> get(K key) {
return Optional.ofNullable(viewMap.get(key));
}
public void put(K key, GenericRecord value) {
viewMap.set(key, value);
}
public void remove(K key) {
viewMap.delete(key);
}
public GenericRecord executeOnKey(K key,
EntryProcessor<K, GenericRecord, GenericRecord> processor) {
return viewMap.executeOnKey(key, processor);
}
}
We use GenericRecord for the values — Hazelcast’s schema-flexible format that doesn’t require Java classes on the cluster. The executeOnKey method gives us atomic updates through Hazelcast’s EntryProcessor. As we discussed in Part 2, the EntryProcessor runs on the partition thread for that key — the same single-threaded-per-partition model that makes our pipeline ordering work. It reads current state, applies the change, and writes the result, all on one thread with no possibility of a concurrent modification. Atomic by architecture, not by locking.
The ViewUpdater
The ViewUpdater is the abstract class that each domain implements. It defines two things: how to extract the key from an event, and how to apply the event to produce new state.
public abstract class ViewUpdater<K> implements Serializable {
protected final transient HazelcastViewStore<K> viewStore;
protected abstract K extractKey(GenericRecord eventRecord);
protected abstract GenericRecord applyEvent(
GenericRecord eventRecord,
GenericRecord currentState);
public GenericRecord updateDirect(GenericRecord eventRecord) {
K key = extractKey(eventRecord);
if (key == null) {
logger.warn("Could not extract key from event");
return null;
}
GenericRecord currentState = viewStore.get(key).orElse(null);
GenericRecord updatedState = applyEvent(eventRecord, currentState);
if (updatedState != null) {
viewStore.put(key, updatedState);
} else if (currentState != null) {
viewStore.remove(key);
}
return updatedState;
}
}
Returning null from applyEvent is the deletion convention. The updater removes the entry from the view. Everything else either creates a new entry or updates an existing one.
The coalesce pattern in updateCustomer handles partial updates — if the event only includes a new email but not a new name, we keep the existing name. The updatedAt timestamp always advances, which is useful for staleness checks later.
Cross-Service Views: Denormalization Without the Guilt
This is where materialized views get genuinely powerful.
Consider displaying an order. The raw order data has a customer ID and product IDs, but no names, no emails, no SKUs. To show a complete order in the UI, you need data that lives in two other services.
The traditional approach:
Order order = orderRepository.findById(orderId);
Customer customer = accountService.getCustomer(order.getCustomerId()); // HTTP call
for (LineItem item : order.getItems()) {
Product product = inventoryService.getProduct(item.getProductId()); // HTTP call
item.setProductName(product.getName());
}
Three services involved at query time. If Account is slow, your order page is slow. If Inventory is down, your order page is broken. And you’ve just coupled three services together at runtime, which is exactly what microservices were supposed to prevent.
With event sourcing, you build an enriched view that bakes in the data from other services at write time:
public class EnrichedOrderViewUpdater extends ViewUpdater<String> {
private final IMap<String, GenericRecord> customerView;
private final IMap<String, GenericRecord> productView;
@Override
protected GenericRecord applyEvent(GenericRecord event, GenericRecord current) {
String eventType = getEventType(event);
if ("OrderCreated".equals(eventType)) {
return createEnrichedOrder(event);
}
return current;
}
private GenericRecord createEnrichedOrder(GenericRecord event) {
String customerId = event.getString("customerId");
// Look up customer from local view — no HTTP call
GenericRecord customer = customerView.get(customerId);
String customerName = customer != null ?
customer.getString("name") : "Unknown";
String customerEmail = customer != null ?
customer.getString("email") : "";
List<GenericRecord> enrichedItems = new ArrayList<>();
// ... iterate through items, look up products from productView
return GenericRecordBuilder.compact("EnrichedOrder")
.setString("orderId", event.getString("key"))
.setString("customerId", customerId)
.setString("customerName", customerName)
.setString("customerEmail", customerEmail)
.setArrayOfGenericRecord("lineItems",
enrichedItems.toArray(new GenericRecord[0]))
.setString("status", "PENDING")
.build();
}
}
The result is a single IMap entry with everything you need:
One lookup. Zero service calls. Works if Account and Inventory are both down for maintenance.
If you’re from a relational database background, denormalization feels wrong — it violates third normal form, it duplicates data, your DBA would glare at you. But in an event-sourced system, the events are the normalized source of truth. Views are disposable projections. Denormalize all you want. If Alice changes her name, the CustomerUpdated event flows through, and you can update the enriched order view to reflect it. Or not, depending on whether you care — the order was placed when she was “Alice Smith,” and that’s what the event says.
View Rebuilding
One of the benefits we mentioned in Part 1 — and it’s worth seeing in practice. Found a bug in your view update logic? Fix the code, clear the view, replay the events:
public <D extends DomainObject<K>, E extends DomainEvent<D, K>> long rebuild(
EventStore<D, K, E> eventStore) {
logger.info("Starting view rebuild for {} from {}",
viewStore.getViewName(), eventStore.getStoreName());
viewStore.clear();
AtomicLong count = new AtomicLong(0);
eventStore.replayAll(eventRecord -> {
updateDirect(eventRecord);
count.incrementAndGet();
});
logger.info("Rebuild complete. Processed {} events", count.get());
return count.get();
}
Same mechanism handles schema migrations, new view types (write a new ViewUpdater, replay existing events — instant backfill), and disaster recovery. With CRUD, corrupted data is permanently corrupted. With event sourcing, the correct data is always recoverable from the event stream.
View Patterns
Not every view is a 1:1 entity projection. Here are the patterns we use:
Entity View — one entry per domain object. Customer events produce one Customer view entry, keyed by customer ID. This is the default.
Lookup View — index by an alternate key. A Customer-by-Email view maps email → customerId, so you can find a customer by email without scanning. When a customer changes their email, the old entry gets deleted and a new one created.
Summary View — aggregate across entities. A Customer Order Summary view tracks customerId → { totalOrders, totalSpent }, incrementing on OrderCreated and decrementing on OrderCancelled.
Enriched View — denormalization across services, as we just covered. Order events plus customer and product data produce a self-contained order view.
Time-Series View — track changes over time. Daily inventory snapshots keyed by date + productId, updated by stock reservation and release events.
You can maintain as many views as you need from the same event stream. They’re independent — a bug in one doesn’t affect the others, and each can be rebuilt separately.
Handling View Staleness
Views are updated asynchronously by the pipeline. There’s a brief window — usually measured in milliseconds — where the view might not reflect the very latest event. Three ways to deal with it:
Accept it. For most reads, a few milliseconds of staleness is fine. Just read the view.
Customer customer = customerView.get(customerId);
Wait for it. When you need to read your own writes — like returning a newly created customer right after creating them — wait for the pipeline to complete:
CompletableFuture<EventCompletion> future = controller.handleEvent(event, correlationId);
future.join();
// Now the view is guaranteed to reflect this event
Customer customer = customerView.get(customerId);
Check it. Include a version or timestamp and verify the view is current enough:
long expectedVersion = event.getTimestamp().toEpochMilli();
Customer customer = customerView.get(customerId);
if (customer.getUpdatedAt() < expectedVersion) {
// View not yet updated — wait or return the event data directly
}
Our framework’s handleEvent returns a CompletableFuture specifically to make the “wait for it” path easy. Most API endpoints use it.
Performance
Materialized view reads are IMap lookups. On a single laptop (M4 MacBook Pro, 16GB, Docker Compose):
Operation
Latency
View read (local partition)
< 0.1ms
View read (remote partition)
< 0.5ms
View read (near cache)
< 0.01ms
View update (pipeline Stage 4)
< 1ms
For views that get read far more often than they’re written — which is most of them — near cache is worth enabling:
This caches view entries locally on the client side. Reads drop to microseconds. The TTL ensures stale entries get refreshed, and LRU eviction keeps memory bounded.
Predefined Queries vs. Ad-Hoc Access
Everything we’ve built so far serves predefined query patterns — get customer by ID, get enriched order, look up product stock. The services expose exactly the queries their API needs, and those queries are fast because the views are designed for them.
Sooner or later, though, someone from the business side is going to want more. “Show me all orders over $500 in the last 30 days.” “Which customers changed their address this quarter?” Hazelcast supports SQL queries against IMaps — through Management Center, the command-line client, or third-party tools like DBeaver. The capability is there.
But I wouldn’t be eager to have my production throughput impacted by someone’s exploratory query with a broad predicate scanning thousands of entries. An analyst’s ad-hoc join returning ten thousand rows is a very different workload than a targeted key lookup, and it can absolutely affect the latency your customers see.
If you want to give analysts this kind of access — and there are good reasons to — consider WAN replication (a Hazelcast Enterprise feature) to duplicate your IMap data into a secondary cluster dedicated to analytics. Analysts query to their heart’s content; production stays untouched. This is really just CQRS taken one step further. We already separated the write model (events) from the operational read model (views). WAN replication separates the operational read model from the analytical read model. Same principle, another level of isolation.
Practical Advice
Keep your views focused. A CustomerView for profile queries, a CustomerByEmailView for authentication, a CustomerOrderSummary for order history. Don’t build a CustomerEverythingView that tries to serve every possible query pattern — it’ll be expensive to update and awkward to query.
Document what events feed each view. When someone asks “why does the order view show the wrong customer name?” you want to be able to trace it: “The order view is updated by OrderCreated events, and it reads customer data from the Customer view at enrichment time. If the customer name changed after the order was created, the order view still shows the name at order time.”
Handle missing data gracefully. When the enriched order view looks up a customer and gets null — maybe the customer was deleted, maybe the customer event hasn’t propagated yet — don’t crash. Use a sensible default:
And think about rebuild cost before your event history gets large. Replaying a million events to rebuild a view takes time. Strategies include periodic snapshots (save the view state and replay only from the snapshot forward), incremental rebuilds (track the last processed event sequence), and parallel rebuilds for independent views.
That covers the read side of CQRS — how we turn an append-only event stream into fast, queryable state. The views are disposable, rebuildable, and independent. Denormalization is a feature, not a compromise. And the whole thing runs at IMap speed because the read model lives in the same platform as the processing engine.
Next we’ll move from a single service to many. When a business operation has to update state in two or three different services and any one of them can fail, two-phase commit is off the table and you need a different answer. That’s the saga pattern.
Next up: The Saga Pattern for Distributed Transactions
Part 2 in the “Building Event-Driven Microservices with Hazelcast” series
Introduction
In Part 1 we talked about event sourcing — what it is, why it matters, and how events flow through our framework. What we didn’t talk much about is the thing that actually makes the events flow: the Hazelcast Jet pipeline sitting in the middle of everything.
Jet is Hazelcast’s stream processing engine, and it’s doing all the heavy lifting here. It reads events as they arrive, pushes them through six processing stages — persist, update view, publish, and so on — and it does this with backpressure handling and sub-millisecond latency because everything stays in-memory. No external message broker. No separate stream processor. Jet is embedded right in Hazelcast, which means your event processing pipeline has direct, local access to the same IMaps that store your events and views.
Let’s walk through how it works.
The 6-Stage Pipeline
Six stages, each with one task. An event enters on the left and comes out the other side persisted, applied to a view, published to subscribers, and marked complete. If any stage fails, the context carries that information forward so downstream stages can react (usually by skipping).
Stage 1: Source — Reading from Event Journal
The pipeline reads events from the pending events IMap, but not by polling it. That would be terrible — you’d burn CPU checking for new entries, you’d introduce latency, and you’d probably miss things under load. Instead, we use the Event Journal.
The Event Journal is a ring buffer that Hazelcast maintains on each IMap. Every write to the map is recorded in the journal, and Jet can stream directly from it. You configure it on the map:
Config config = new Config();
config.getMapConfig("Customer_PENDING")
.getEventJournalConfig()
.setEnabled(true)
.setCapacity(10000);
START_FROM_OLDEST means after a restart the pipeline picks up from the beginning of the journal — no events lost. withIngestionTimestamps() stamps each event as it enters the pipeline, which we use later for latency tracking.
PartitionedSequenceKey: Why the Map Key Isn’t a Simple String
You probably noticed the key type is PartitionedSequenceKey<K> rather than, say, a customer ID. There’s a reason for that, and it took us a bit to figure out we needed it.
Jet guarantees that events within the same Hazelcast partition are processed in order. Events across different partitions? No ordering guarantee. They can arrive in any order.
That’s fine as long as all events for a given entity share the same key. But consider this: you create an account keyed by customer ID, then immediately deposit funds into it keyed by transaction ID. Those two keys might hash to different partitions. Jet could process the deposit before the account exists. The deposit fails. The customer calls support. Nobody’s having a good day.
PartitionedSequenceKey separates two concerns that are easy to conflate — the event’s unique identity and the partition it should land on:
public class PartitionedSequenceKey<K>
implements PartitionAware<K>, Comparable<PartitionedSequenceKey<K>>,
Serializable {
private final long sequence; // Globally unique FlakeId for ordering
private final K partitionKey; // Domain object ID for co-location
@Override
public K getPartitionKey() {
return partitionKey; // Hazelcast routes by this, not the full key
}
}
The sequence comes from Hazelcast’s FlakeIdGenerator — globally unique, monotonically increasing. The partitionKey is the domain object’s ID (customer ID, not transaction ID). Because the key implements PartitionAware, Hazelcast uses getPartitionKey() for partition assignment instead of hashing the whole composite key. So the account creation and the deposit both route to the customer’s partition, and Jet processes them in sequence order.
Why This Is Lock-Free
There’s something worth understanding about why partition-aware routing gives us ordered, atomic processing — and it’s not locking.
Hazelcast distributes data across 271 partitions (by default), and each partition is owned by exactly one thread — the partition thread. All operations on keys in that partition are executed by that single thread, sequentially. There’s no concurrent access to fight over, so there’s no need for locks, CAS operations, or synchronized blocks. Two events for customer-123 arrive a millisecond apart? They both route to the same partition, and the partition thread processes them one after the other. Ordering and atomicity come from the threading architecture itself.
This carries through to the materialized view layer too. When we call executeOnKey with an EntryProcessor in Part 3, that processor runs on the partition thread for that key. It reads the current state, applies the event, writes the new state — all in one shot, on one thread, with no possibility of a conflicting update sneaking in between. It’s atomic by construction, not by locking.
The practical payoff is concurrency without contention. Thousands of events per second can flow through the pipeline, and as long as they’re for different entities, they’re processed in parallel across different partition threads. Events for the same entity are serialized — but by the architecture, not by a bottleneck. You get high throughput and strict per-entity ordering at the same time, and you didn’t write a single line of synchronization code to get it.
Stage 2: Enrich — Adding Metadata
Before we do anything with the event, we stamp it with pipeline entry time and extract identification fields. This gives us latency tracking from the moment the event enters the pipeline, not just from when it was created.
(Note the static method — that’s not accidental. We’ll come back to why in the architecture section.)
The EventContext is the envelope that carries the event through all remaining stages:
private static class EventContext<K> implements Serializable {
final PartitionedSequenceKey<K> key;
final GenericRecord eventRecord;
final String eventType;
final String eventId;
final Instant pipelineEntryTime;
// Stage completion flags
boolean persisted;
boolean viewUpdated;
boolean published;
// Stage timing
Instant persistTime;
Instant viewUpdateTime;
Instant publishTime;
}
Each stage sets its own flag and timestamp. If persist fails, persisted stays false, and the view update stage sees that and skips. No event gets partially applied without the context knowing about it.
Stage 3: Persist — Writing to the Event Store
This is the durability step. Once an event is in the event store, it’s the permanent record of what happened. Everything else — views, notifications, completions — can be rebuilt from events. This one can’t.
Why the ServiceFactory indirection instead of just passing in an EventStore reference? Because Jet pipelines are distributed. The pipeline definition gets serialized and shipped to whatever nodes Jet decides to run it on. If your lambda captures a reference to a Spring-managed EventStore bean — which holds a HazelcastInstance, database connections, and who knows what else — that serialization is going to fail spectacularly at runtime.
ServiceFactory solves this: you give Jet a recipe for creating the service, and it creates an instance on each node where the pipeline actually runs.
public class EventStoreServiceCreator<D extends DomainObject<K>, K,
E extends DomainEvent<D, K>>
implements FunctionEx<ProcessorSupplier.Context, HazelcastEventStore<D, K, E>> {
private final String domainName;
@Override
public HazelcastEventStore<D, K, E> applyEx(ProcessorSupplier.Context ctx) {
HazelcastInstance hz = ctx.hazelcastInstance();
return new HazelcastEventStore<>(hz, domainName);
}
}
The creator itself is just a domain name string — easily serializable. The heavy objects get created on the target node using that node’s local HazelcastInstance.
Stage 4: Update View — Applying to Materialized View
With the event persisted, we update the materialized view. This is where the event’s apply logic runs — transforming the event into current state.
Returning null from applyEvent is the deletion convention — the updater removes the entry from the view. Everything else either creates or modifies the view record.
Stage 5: Publish — Notifying Other Services via ITopic
Once the view is updated, we publish the event so other services can react. This is where cross-service communication happens. Saga listeners subscribe to topics like “OrderCreated” or “StockReserved” and kick off their own workflows when those events arrive.
The TopicPublisher is thin — it resolves the ITopic by event type name and publishes:
public class EventTopicPublisherServiceCreator
implements FunctionEx<ProcessorSupplier.Context, TopicPublisher> {
@Override
public TopicPublisher applyEx(ProcessorSupplier.Context ctx) {
return new TopicPublisher(ctx.hazelcastInstance());
}
public static class TopicPublisher {
private final HazelcastInstance hazelcast;
public void publish(String topicName, GenericRecord record) {
ITopic<GenericRecord> topic = hazelcast.getTopic(topicName);
topic.publish(record);
}
}
}
Each event type gets its own ITopic. The Order service subscribes to StockReserved and PaymentProcessed; the Inventory service subscribes to OrderPlaced. Nobody hears events they can’t act on.
Stage 6: Complete — Signaling Completion (and Smuggling Out Metrics)
The last stage writes a completion record so the calling code knows the event made it through. But there’s a twist.
Jet pipelines run inside Hazelcast’s execution engine. They don’t have access to Spring’s MeterRegistry, or any external metrics system. The pipeline can time every stage internally — it does, that’s what all those Instant.now() calls are for — but it has no way to report those timings to Prometheus or Grafana directly.
So we cheat. The completion record carries the timestamps out:
The controller picks up the completion, extracts the timestamps, and reports them to Micrometer. The pipeline instruments itself; the controller publishes the metrics. It also gets the stage success flags — persisted, viewUpdated, published — so it knows whether the event made it through clean or had partial failures along the way.
The controller side looks like this:
public CompletableFuture<EventCompletion> handleEvent(DomainEvent event, UUID correlationId) {
pendingEventsMap.set(key, eventRecord);
return CompletableFuture.supplyAsync(() -> {
int maxWait = 5000;
int elapsed = 0;
while (elapsed < maxWait) {
GenericRecord completion = completionsMap.get(key);
if (completion != null) {
return EventCompletion.success(eventId);
}
Thread.sleep(10);
elapsed += 10;
}
throw new TimeoutException("Event processing timeout");
});
}
Write the event to the pending map, then poll the completions map until the pipeline writes the result. Five seconds is plenty — the pipeline usually finishes in under a millisecond.
Jet Pipeline Traps (and How We Avoid Them)
If you’re used to writing normal Java code, Jet pipelines will surprise you in a few unpleasant ways. These are the ones that cost us the most time.
The Serialization Trap
Every lambda in the pipeline must be serializable, because Jet ships the pipeline definition to whichever nodes it decides to run on. If your lambda captures this, and this has a HazelcastInstance field, or a Spring ApplicationContext, or anything else that doesn’t serialize — you get a NotSerializableException at runtime. Not at compile time. At runtime. In production, ideally on a Friday.
Three rules keep you out of trouble:
Use static methods as pipeline stages (they don’t capture this). Use ServiceFactory for anything that needs a Hazelcast instance or other heavy resource. And if you must capture a value, make sure it’s a final local variable that’s actually serializable.
// This will ruin your weekend
.map(event -> this.eventStore.append(event))
// This will not
.mapUsingService(eventStoreFactory, (store, event) -> store.append(event))
The Try-Catch That Catches Nothing
This one’s sneaky. Your instinct when writing the pipeline is to wrap the whole thing in a try-catch:
try {
Pipeline pipeline = Pipeline.create();
// ... define all six stages ...
hazelcast.getJet().newJob(pipeline, jobConfig);
} catch (Exception e) {
logger.error("Pipeline failed!", e); // This catches almost nothing useful
}
That try-catch protects you during pipeline assembly — if you misspell a map name or pass an invalid config, you’ll catch it. But assembly isn’t where things go wrong. Things go wrong during execution, and by that point your code is somewhere else entirely.
When you call newJob(), Jet takes your pipeline definition, converts each stage into a processor, and distributes those processors across the cluster. The actual event processing happens inside those distributed processors, on whichever nodes Jet assigned them to. You’re not just off the end of that try-catch block — you might not even be on the same machine.
That’s why every stage in our pipeline has its own try-catch inside the lambda:
The error handling has to live where the code actually runs — inside each processor, on whatever node Jet put it on. Downstream stages check the context flags before doing work. If persist failed, the view update skips. If the view update failed, publish skips. The completion record captures exactly what happened, so you can diagnose partial failures from the metrics.
Cooperative, Non-Cooperative, Shared, and Non-Shared
Jet services have two dimensions you need to get right, and they’re independent of each other.
Cooperative vs. non-cooperative is about blocking. Jet’s default threading model is cooperative — stages share a thread pool and are expected to yield quickly. If a stage does I/O (writes to the event store, makes a network call, publishes to ITopic), it blocks the thread. Mark it non-cooperative so Jet gives it a dedicated thread:
Forget this and your pipeline stalls under load. The cooperative thread pool gets monopolized by I/O-bound stages that never yield, and throughput craters.
Shared vs. non-shared is about thread safety. A shared service creates one instance per node and hands it to all processors on that node — multiple threads will call it concurrently. A non-shared service creates a separate instance for each processor, so no concurrent access.
Our EventStore and TopicPublisher are backed by Hazelcast data structures (IMap, ITopic), which are thread-safe, so they’re shared. But if you had a service that accumulated state in a local buffer, or maintained a non-thread-safe connection like a raw JDBC Connection, you’d mark it non-shared:
// Thread-safe service — one instance shared across processors on this node
ServiceFactories.sharedService(creator).toNonCooperative();
// NOT thread-safe — each processor gets its own instance
ServiceFactories.nonSharedService(creator).toNonCooperative();
Get the combination wrong and you get either unnecessary memory overhead (non-shared when shared would do) or, worse, data corruption from concurrent access to a non-thread-safe service (shared when it shouldn’t be). That second one is the kind of bug that shows up intermittently under load and makes you question your career choices.
Lifecycle Management
Starting and stopping the pipeline:
public class EventSourcingPipeline<D, K, E> {
private volatile Job pipelineJob;
public Job start() {
if (pipelineJob != null && !pipelineJob.isUserCancelled()) {
logger.warn("Pipeline already running");
return pipelineJob;
}
Pipeline pipeline = buildPipeline();
JobConfig jobConfig = new JobConfig()
.setName(domainName + "-EventSourcingPipeline");
pipelineJob = hazelcast.getJet().newJob(pipeline, jobConfig);
logger.info("Started pipeline for {}, jobId: {}",
domainName, pipelineJob.getId());
return pipelineJob;
}
public void stop() {
if (pipelineJob != null) {
pipelineJob.cancel();
logger.info("Stopped pipeline for {}", domainName);
}
}
}
Each domain (Customer, Order, Inventory) gets its own pipeline instance running as a Jet job. They’re independent — you can restart the Order pipeline without affecting Customers.
Performance
All of this runs on a single Apple MacBook Pro (M4, 16GB) with three Hazelcast instances in Docker Compose:
Metric
Value
Throughput
100,000+ events/sec
P50 Latency
< 0.3ms
P99 Latency
< 1ms
P99.9 Latency
< 5ms
That’s the materialized view layer — events flowing through the pipeline and out the other side. It’s fast because everything is in-memory, Jet parallelizes across partitions, the Event Journal provides non-blocking streaming, and views and events live on the same cluster (no network hop for the view update).
We verify it with a load test that submits 100K events and waits for all completions:
// Load test
@Test
void loadTest_100KEvents() {
int eventCount = 100_000;
List<CompletableFuture<?>> futures = new ArrayList<>();
long start = System.nanoTime();
for (int i = 0; i < eventCount; i++) {
CustomerCreatedEvent event = createTestEvent(i);
futures.add(controller.handleEvent(event, UUID.randomUUID()));
}
CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();
long duration = System.nanoTime() - start;
double tps = eventCount / (duration / 1_000_000_000.0);
System.out.printf("Processed %d events in %dms (%.0f TPS)%n",
eventCount, duration / 1_000_000, tps);
}
Monitoring
The pipeline metrics — events processed, processing time histograms, pending event counts — feed into Micrometer and from there to Prometheus and Grafana. We’ll dig into the observability setup in a later post, but here’s the shape of it:
public class PipelineMetrics {
private final Counter eventsProcessed;
private final Timer eventProcessingTime;
public PipelineMetrics(MeterRegistry registry, String domainName) {
this.eventsProcessed = Counter.builder("events.processed")
.tag("domain", domainName)
.register(registry);
this.eventProcessingTime = Timer.builder("events.processing.time")
.tag("domain", domainName)
.register(registry);
}
}
Remember how Stage 6 smuggles timing data out in the completion record? This is where it ends up — the controller reads the timestamps, calculates durations, and records them here.
That’s the pipeline. Six stages, each with one task, connected by an EventContext that carries the event and its processing history all the way through. The design decisions that matter most: PartitionedSequenceKey for ordering, ServiceFactory for distributed service access, static methods to dodge serialization traps, and a completion record that doubles as a metrics transport.
Next up we’ll look at what happens after Stage 4 in more detail — how materialized views are designed, queried, and rebuilt.
Part 1 in the “Building Event-Driven Microservices with Hazelcast” series
Introduction
Here’s something that should bother you more than it probably does: every time you run an UPDATE statement against a database, you’re destroying information.
UPDATE customers SET address = '456 Oak Ave' WHERE id = 'cust-123';
That customer used to live at 123 Main St. Now they don’t. And you have no idea that 123 Main St ever existed, because you just overwrote it. The old address is gone. The audit trail is gone. If someone asks “where did this customer live six months ago?” you shrug and check if maybe somebody logged it somewhere.
We’ve been building systems like this for decades, and honestly, it mostly works. Until it doesn’t. Until you need to debug a production issue and you can’t figure out what sequence of changes got the system into this state. Until compliance asks for a history of every change to customer PII. Until Service A goes down and takes Services B, C, and D with it because they all need to call A synchronously to get data they should already have.
Event sourcing flips this model. Instead of storing current state, you store the sequence of events that produced it. The current state becomes something you derive — a view that you can rebuild from events at any time.
In this post, we’ll look at how to implement event sourcing with Hazelcast, which turns out to be a remarkably good fit for this pattern. Fast writes for the event stream, real-time processing with Jet pipelines, and sub-millisecond reads from materialized views — it’s basically the whole event sourcing infrastructure in one platform.
What is Event Sourcing?
The core idea is simple enough to state in three lines:
Every state change is captured as an immutable event
Current state is computed by replaying those events
The complete history is preserved forever
That third point is the one that makes people nervous. “Forever? Really? Every event?” Yes. That’s the deal. And it turns out to be the source of most of the pattern’s power.
A Quick Example
Take that customer we just destroyed with an UPDATE. In event sourcing, there is no UPDATE. There are only events:
The current address is still 456 Oak Ave. But the old address is still there in Event 1. Nothing was overwritten. Nothing was lost. You can reconstruct the customer’s state at any point in time by replaying events up to that moment.
Why Event Sourcing?
I get it — the first time you encounter event sourcing, the reaction is usually something like “you want me to store every event forever instead of just updating a row? That sounds like a lot more work.” And honestly, it does feel unfamiliar, and unfamiliar feels like ick. You need a compelling reason to push past that.
There are several.
Microservices Need Decoupling, and Events Actually Deliver It
The whole pitch for microservices is services that can be developed, deployed, and scaled independently. Beautiful theory. In practice, the independence evaporates the moment one service makes a synchronous REST call to another to get data it needs.
Think about it. The Account service goes down — does the Order service go down with it? If Order is making HTTP calls to Account to look up customer names, then yes. Probably yes. You’ve built yourself a distributed monolith: all the operational complexity of microservices with none of the resilience benefits. Congratulations.
Event sourcing solves this at an architectural level, not with duct tape. When a customer is created, the Account service publishes a CustomerCreatedEvent. The Order service subscribes to those events and builds its own local materialized view of customer data. No HTTP call. No dependency on Account being up. No shared database.
If Account goes down for maintenance on a Tuesday afternoon, Order keeps running. It already has the customer data it needs. The services are actually decoupled — not just deployed to separate containers while secretly depending on each other for every request.
Events Capture What Actually Happened
Here’s a distinction that seems pedantic until it saves you during a production incident. A database transaction log records what changed:
UPDATE customers SET address = '456 Oak Ave' WHERE id = 'cust-123';
A domain event records what happened in business terms:
CustomerMovedEvent { customerId: 'cust-123', previousAddress: '123 Main St',
newAddress: '456 Oak Ave', reason: 'RELOCATION' }
The transaction log tells you a column changed. The domain event tells you a customer moved, where they moved from, and why. That business context gets captured once, at the moment it happens, and it’s preserved forever.
When someone asks “why did this order ship to the wrong address?” you can trace the exact sequence: here’s when the customer moved, here’s the order that was placed two hours before the address change propagated, here’s why the old address was used. With CRUD, you’d be spelunking through application logs and hoping somebody thought to log the right thing.
Event Streams Are AI-Ready Data
This one has been sneaking up on us.
An event-sourced system doesn’t just store what the world looks like right now. It stores how it got there — a sequence of business actions, ordered in time, with context attached. That turns out to be exactly the kind of data that AI systems are hungry for.
Look at what an AI agent sees when it queries a traditional database: customer has address X, has ordered Y items, has a balance of Z. A flat snapshot. Static. Not much to work with beyond simple lookups.
Now look at what it sees in an event stream:
CustomerCreatedEvent → new customer in Seattle
ProductViewedEvent (×12) → browsed camping gear heavily
CartUpdatedEvent (×4) → added/removed items, compared prices
OrderPlacedEvent → bought mid-range tent, not the premium one
CustomerMovedEvent → relocated to Denver
ProductViewedEvent (×8) → browsing ski equipment
That’s a behavioral narrative. Purchase hesitation patterns. Geographic lifestyle shifts. Seasonal interest changes. None of that exists in a snapshot of current state.
Now, I should be honest here — databases have transaction logs, and you can do analytics on those. But transaction logs record row-level mutations: SET column = value. Domain events record business actions with semantic meaning. The difference is the difference between “field changed” and “customer moved to Denver and started shopping for ski gear.” One is plumbing. The other is insight.
Why Hazelcast?
So you’re sold on event sourcing (or at least willing to keep reading). Why Hazelcast as the platform?
Because event sourcing needs three things to work well, and Hazelcast handles all of them:
Fast writes for the event stream. Events go into an IMap — in-memory, sub-millisecond. You’re not waiting for a disk flush on the critical path.
Real-time processing. Hazelcast’s Event Journal streams events to Jet pipelines as they arrive. No polling, no batch windows. Events flow through processing stages — persist to event store, update materialized view, publish to subscribers — as a continuous pipeline.
Fast reads from materialized views. Once the pipeline updates a view, queries against it are sub-millisecond IMap lookups. No joins, no aggregation at query time.
Add ITopic for pub/sub event distribution across services and native horizontal scaling across cluster nodes, and you’ve got a complete event sourcing infrastructure without bolting together five different technologies.
Core Concepts
Domain Events
A domain event represents something meaningful that happened in your business. In our framework, events extend DomainEvent:
public abstract class DomainEvent<D extends DomainObject<K>, K>
implements UnaryOperator<GenericRecord>, Serializable {
// Event identification
protected String eventId; // Unique ID (UUID)
protected String eventType; // e.g., "CustomerCreated"
protected String eventVersion; // For schema evolution
// Event metadata
protected String source; // Service that created it
protected Instant timestamp; // When it happened
// Domain object reference
protected K key; // Key of affected entity
// Traceability
protected String correlationId; // Links related events
// The key method: how does this event change state?
public abstract GenericRecord apply(GenericRecord currentState);
}
The apply() method is the interesting part — it defines how this event transforms current state into new state. Each event type implements its own version of this, which means the logic for “what does this event do?” lives with the event itself, not in some giant switch statement somewhere.
If you’ve spent time with Domain-Driven Design, you’ll notice that our DomainObject<K> is what DDD calls an aggregate root. Every event is scoped to exactly one aggregate — CustomerCreatedEvent belongs to the Customer aggregate, OrderPlacedEvent belongs to the Order aggregate. Consistency is enforced within the aggregate boundary: all events for a given customer are ordered and processed sequentially (we’ll see how Hazelcast’s partition threading makes this work in Part 2). Across aggregates — between a customer and an order, say — we accept eventual consistency and coordinate through sagas. We don’t go deep into DDD vocabulary in this series, but if you know the terminology, you’ll see the boundaries everywhere.
A Concrete Example
Here’s CustomerCreatedEvent from our eCommerce implementation:
public class CustomerCreatedEvent extends DomainEvent<Customer, String> {
public static final String EVENT_TYPE = "CustomerCreated";
private String email;
private String name;
private String address;
private String phone;
public CustomerCreatedEvent(String customerId, String email,
String name, String address) {
super(customerId); // Sets the key
this.eventType = EVENT_TYPE;
this.email = email;
this.name = name;
this.address = address;
}
@Override
public GenericRecord apply(GenericRecord currentState) {
// For creation events, ignore current state and create new
return GenericRecordBuilder.compact("Customer")
.setString("customerId", key)
.setString("email", email)
.setString("name", name)
.setString("address", address)
.setString("status", "ACTIVE")
.setInt64("createdAt", Instant.now().toEpochMilli())
.build();
}
}
Notice that apply() for a creation event ignores currentState entirely — there is no current state, we’re creating it. Update events would read from currentState and modify specific fields.
The Event Store
The event store is append-only. Events go in and they don’t come out (well, they come out for reads, but they’re never modified or deleted). It’s the permanent, immutable record of everything that happened.
public interface EventStore<D extends DomainObject<K>, K,
E extends DomainEvent<D, K>> {
void append(K key, GenericRecord eventRecord);
List<GenericRecord> getEventsForKey(K key);
void replayByKey(K key, Consumer<GenericRecord> eventConsumer);
long getEventCount();
}
Materialized Views
If events are the source of truth, how do you actually query current state without replaying the entire event history every time someone calls GET /customers/123? You don’t. You maintain materialized views.
A materialized view is a pre-computed projection — it gets updated incrementally as each event flows through the pipeline. Think of it as a cache that’s always consistent with the event stream, because it’s derived from it.
The view always reflects the latest state. Reads are instant — it’s just an IMap lookup.
The Event Flow
Here’s the full lifecycle of an event in our framework:
A REST request comes in. The service creates an event and drops it into the pending events IMap. The Jet pipeline picks it up via the Event Journal, then processes it through four stages: persist to the event store, update the materialized view, publish to subscribers via ITopic, and write a completion record. The API returns the customer data from the now-updated view.
That’s eight steps for what CRUD does in one database write. And yeah, that’s more moving parts. But each of those parts is doing something valuable — you get an audit trail, a materialized view, cross-service notification, and async completion tracking, all from a single event submission.
The Service Layer
Here’s what it looks like from the service code’s perspective:
@Service
public class AccountService {
private final EventSourcingController<Customer, String,
DomainEvent<Customer, String>> controller;
public CompletableFuture<Customer> createCustomer(CustomerDTO dto) {
CustomerCreatedEvent event = new CustomerCreatedEvent(
UUID.randomUUID().toString(),
dto.getEmail(),
dto.getName(),
dto.getAddress()
);
UUID correlationId = UUID.randomUUID();
return controller.handleEvent(event, correlationId)
.thenApply(completion -> {
return getCustomer(event.getKey()).orElseThrow();
});
}
public Optional<Customer> getCustomer(String customerId) {
GenericRecord gr = controller.getViewMap().get(customerId);
if (gr == null) return Optional.empty();
return Optional.of(Customer.fromGenericRecord(gr));
}
}
The service creates an event, not a database record. handleEvent() returns a CompletableFuture that completes when the pipeline has processed the event all the way through. And reads come from the materialized view — no event replay, just a fast IMap lookup.
Benefits in Practice
The architectural arguments are nice, but let’s see what this actually looks like when you’re debugging at 2am or fielding a compliance request.
Audit Trail
Every change is recorded. Every single one.
List<GenericRecord> history = eventStore.getEventsForKey("cust-123");
for (GenericRecord event : history) {
System.out.println(event.getString("eventType") + " at " +
event.getInt64("timestamp"));
}
CustomerCreated at 1706374800000
CustomerUpdated at 1706375100000
CustomerAddressChanged at 1706461200000
No special audit logging framework. No triggers on database tables. The event store is the audit trail, because it’s the source of truth.
Time Travel
Need to see what the data looked like last Thursday?
GenericRecord pastState = null;
for (GenericRecord event : eventStore.getEventsForKey("cust-123")) {
if (event.getInt64("timestamp") > targetTime) break;
pastState = applyEvent(event, pastState);
}
Replay events up to the timestamp you care about and stop. You now have the exact state of that entity at that moment. Try doing that with a database that only stores current state.
View Rebuilding
Found a bug in your view update logic? Fix the code and rebuild:
viewUpdater.rebuild(eventStore);
Clear the view, replay all events through the corrected logic, done. With CRUD, if your update logic had a bug that corrupted data, that data is just… corrupted. The correct values are gone. You’re restoring from backups and hoping for the best.
Service Independence
The Order service doesn’t call Account to get customer names. It has its own materialized view built from customer events:
Each service maintains the views it needs for its own work. No cross-service calls at query time.
Performance
All the numbers here are from a single Apple MacBook Pro (M4 chip, 16GB RAM) running all services via Docker Compose. This is a baseline — the framework is designed to scale horizontally across Hazelcast cluster nodes and Kubernetes replicas, so treat these as a floor, not a ceiling.
The materialized view layer — the raw speed of IMap operations inside the JVM, bypassing HTTP entirely:
Metric
Value
View update throughput
100,000+ ops/second
P99 latency
< 1ms
View read latency
< 0.5ms
End-to-end through the full pipeline (REST → event store → Jet → view update → ITopic publish):
Metric
Value
Throughput
~300+ TPS
That end-to-end number is using curl in bash subshells as the load driver, which is about the least efficient way to generate HTTP load (every request forks a process and opens a new connection). A proper load testing tool with connection pooling — wrk, k6, Gatling — would push that number higher for the same pipeline.
What’s Next?
This was the foundation — what event sourcing is, why you’d use it, and how it works with Hazelcast. The series continues with deeper dives into the individual components:
Building the event pipeline with Hazelcast Jet
Materialized views for fast queries
Observability in event-sourced systems
The saga pattern for distributed transactions
Vector similarity search with Hazelcast
AI-powered microservices with MCP
Circuit breakers and resilience patterns
Transactional outbox and exactly-once delivery
Dead letter queues and idempotency
Performance engineering at scale
…and more as the framework evolves.
Getting Started
The complete framework is on GitHub — full source code, Docker Compose setup, a sample eCommerce application, load testing tools, and over 2,000 tests.
git clone https://github.com/myawnhc/hazelcast-microservices-framework
cd hazelcast-microservices-framework
./scripts/docker/start.sh
./scripts/load-sample-data.sh
./scripts/demo-scenarios.sh
Event sourcing asks you to shift your thinking from “store what the world looks like” to “record what happened.” It’s a different mental model, and it takes some getting used to. But the payoff — real service decoupling, a complete audit trail, time travel debugging, views you can rebuild from scratch, and an event stream that’s ready for AI to mine — makes it worth the adjustment for a lot of systems.
That’s where we’ll leave things for today.
Next up: Building the Event Pipeline with Hazelcast Jet
I used Claude’s desktop interface for iterative design, then handed off to Claude Code for implementation.
After deciding to revive my Hazelcast Microservices Framework (MSF) project, and to do so using Claude AI to do much of the heavy lifting, it came down to figuring out how to actually do this. I had no playbook for it. Nobody does, really — we’re all making this up as we go.
I wanted to be transparent about my use of Claude, and at the same time I think the development process is interesting enough to be worthy of discussion. (Heck, maybe it’s more interesting than the framework blog posts I set out to write.) So I expect to end up with a dual series of blog posts: the framework posts — started by Claude, co-edited together, and given a final polish by me — interleaved with my observations on how the collaboration effort worked.
This first “behind the scenes” post covers the design phase: going from a vague idea to a set of design documents and an implementation plan, all before writing a single line of code.
Starting the Conversation
Here was my original prompt to Claude:
I want to use Claude Code to help me finish a demonstration project I started some time ago to show how to implement microservices using Hazelcast. (The main value of Hazelcast is to create materialized views of domain objects to maintain in-memory current state.) If it’s more effective, we can restart with a blank sheet rather than modify the existing project.
I’d really like to iterate over the design several times before any coding starts — is that best done in Claude Code, or using this desktop interface? Ideally, creating various specifications or design documents before any coding starts would be perfect, if Claude can use these various documents as a guide to the coding process.
How do we start?
Claude immediately suggested splitting the work across two interfaces: use the desktop/web interface for design discussions and document creation, then move to Claude Code for implementation. Made sense to me — the conversational interface is better for back-and-forth design iteration, while Claude Code excels at multi-file code generation with direct access to the project directory.
This turned out to be excellent advice. The design phase involved a lot of “what about this?” and “actually, let’s reorganize that” — the kind of exploratory conversation that works much better in a chat interface than in a code-focused tool. I tried doing some design work in Claude Code early on and it was noticeably worse — like trying to brainstorm on a whiteboard that keeps trying to compile your diagrams.
The Design Phase: A Roadmap in Nine Documents
What followed was an extended design conversation that produced nine documents over the course of a single session. I’m not going to walk through every one in detail — you can follow the links if you’re curious — but a few of them are worth talking about because of what they reveal about the collaboration process.
Getting Started: Template and Domain
Claude’s first move was to produce a comprehensive design document template covering everything from executive summary to demonstration scenarios. We never actually completed it — the conversation quickly moved in a more specific direction — but it served its purpose as a structural starting point. The architectural equivalent of a napkin sketch: useful for getting the conversation going, not meant to survive contact with reality.
Before we could fill in any template, though, we needed to pick a domain for the demonstration. Claude laid out a comparison between eCommerce and Financial Services, and we settled on a hybrid approach: start with eCommerce (universally understood, clear event flows, and I had existing code to reference) but design the framework to be domain-agnostic so other domains could be plugged in later. We also simplified from four services down to three: Account, Inventory, and Order. (A fourth service, Payment, showed up later when we built out the saga patterns. Scope creep, but the useful kind.)
That decision led to the eCommerce design document — a detailed Phase 1 design covering all three services, their APIs, events, and materialized views. Three view patterns came out of it: denormalized views (joining customer, product, and order data), aggregation views (pre-computing order statistics), and real-time status views (current inventory levels). If you’ve read the previous posts in this series, you’ll recognize these as exactly the kind of thing that makes Event Sourcing + CQRS worth the effort.
Where I Pushed Back
The conversation then turned to longer-term goals. I had ideas for observability dashboards, microbenchmarking, pluggable implementations, saga patterns, and more — far beyond what could fit in a Phase 1. Claude organized all of this into a phased requirements document spanning five phases.
We iterated over this several times, adding and reorganizing. The most significant change I made was moving Event Sourcing from Phase 2 to Phase 1. Claude had initially positioned it as an advanced feature, but I saw it as the fundamental organizing principle of the entire framework — events are the source of truth, not database rows. Once I explained my existing Hazelcast Jet pipeline architecture (where handleEvent() writes to a PendingEvents map, which triggers a Jet pipeline that persists to the EventStore, updates materialized views, and publishes to the event bus), Claude immediately agreed and restructured the phases accordingly.
This was one of the more interesting moments in the collaboration. Claude had made a reasonable default assumption about complexity ordering, but I had domain-specific knowledge about how the architecture should actually work. The back-and-forth was natural — I explained my reasoning, Claude incorporated it, and the result was better for it. If I’d just accepted the initial phasing without pushing back, the entire project would have been organized around a less coherent architecture. And honestly, I almost did just accept it. It looked reasonable. Sometimes the most important contribution you make is going “wait, actually…” when the first answer seems fine.
Other additions during this iteration:
Vector Store integration (Phase 3, optional) for product similarity search
An MCP Server (Phase 3) to let AI assistants query and operate the system
Open source mandate — everything in Phases 1-2 must run on Hazelcast Community Edition
Blog post series structure — features developed in blog-post-sized chunks
Architecture, Code Review, and the Rewrite Decision
The next few documents came quickly. The Event Sourcing discussion led to a dedicated architecture document detailing the Jet pipeline design — based heavily on my existing implementation, but now formally documented with all six pipeline stages, the EventStore design, and how event replay would work.
Then I uploaded several key source files from the original project for Claude to review: the EventSourcingController, DomainObject, SourcedEvent (later renamed to DomainEvent), EventStore, and EventSourcingPipeline. Claude produced a thorough code review comparing the existing code against the design documents. The verdict was encouraging — the core implementation was solid and matched the Phase 1 design almost perfectly. Claude recommended incremental enhancement: add correlation IDs, framework abstractions, observability, and tests on top of what was already there.
I went the other way. After thinking about the package naming, dependency versions, and scope of changes needed, I decided on a clean reimplementation using the existing code as a blueprint. This let us start with the right project structure, package names (com.theyawns.framework.*), and dependency versions (Spring Boot 3.2.x, Hazelcast 5.6.0) from the beginning rather than refactoring them in later. Sometimes — as I’d noted in the previous post — the right move is to stop patching the old cabinets and start fresh.
I won’t pretend this was a purely rational decision. Part of it was just wanting that clean-slate feeling — new project, new structure, no legacy cruft staring at me from the imports. Developers love a greenfield. We can’t help it.
The Implementation Plan
Once the architecture was validated and we’d agreed on the approach, Claude created a detailed Phase 1 implementation plan — a three-week, day-by-day schedule with code templates, success criteria, and task checklists:
We made a few tweaks (updating Hazelcast from 5.4.0 to 5.6.0, for instance), and then it was time to move to code.
The Handoff to Claude Code
Claude provided specific instructions for transitioning to Claude Code, including a context block to paste when starting the session:
I'm building a Hazelcast-based event sourcing microservices framework.
Project location: hazelcast-microservices-framework/
Current state: Design documents complete, ready for implementation
Key decisions:
- Clean reimplementation (no existing code to port)
- Spring Boot 3.2.x + Hazelcast 5.6.0 Community Edition
- Package: com.theyawns.framework.*
- Three services: Account, Inventory, Order (eCommerce domain)
- Event sourcing with Hazelcast Jet pipeline
- REST APIs only
Implementation plan: docs/implementation/phase1-implementation-plan.md
Starting with Day 1: Maven project setup + core abstractions
Please read the implementation plan and let's begin.
The whole point of the “design first” approach: you’re not asking the AI to guess at your architecture. You’re handing it a blueprint. The more detailed the blueprint, the less time you spend arguing about load-bearing walls later.
Documents 7-9: Claude Code Configuration
Before making the jump, I asked Claude about setup suggestions for Claude Code. This produced three more documents:
CLAUDE.md (originally called .clinerules — I’m still not sure where that name came from) is the main configuration file that Claude Code reads automatically. It defines code standards, patterns, pitfalls to avoid, and documentation requirements. This file evolved a lot over the course of the project; looking at the commit history gives a good sense of how the “rules” grew and adapted as we ran into new situations. (More on that in a future post — it turned out to be one of the more interesting aspects of the whole process.)
claude-code-agents.md defined eight specialized agent personas — Framework Developer, Service Developer, Test Writer, Documentation Writer, Pipeline Specialist, and others — each with specific rules, code patterns, and checklists. The idea was to switch between personas depending on the task at hand (e.g., “Switch to Test Writer agent. Write comprehensive tests for EventSourcingController.”). Whether this actually helped or was just a placebo is something I’m still not sure about, honestly.
A docs organization guide rounded out the set, providing a recommended directory structure for keeping all the documentation organized as the project grew.
What Came Next
The resulting project grew well beyond the original three-week Phase 1 plan. At 150 commits, it now includes four microservices (Payment was added for saga demonstrations), an API Gateway, an MCP server for AI integration, choreographed and orchestrated saga patterns, PostgreSQL persistence, Grafana dashboards, and more. The three-week plan took considerably longer than three weeks. So it goes.
But all of that implementation work — and the interesting stories about how human-AI collaboration played out during coding — is material for future posts.
What I’d Do Differently (And What I’d Do Again)
If you’re thinking about using AI for a non-trivial coding project, here’s what I took away from the design phase.
Use the right tool for each phase. The conversational interface is great for the messy, exploratory work of figuring out what you’re actually building. Claude Code is great for building it.
Iterate on design before you write code. We went through multiple rounds of revision on the requirements and architecture documents. Each round caught issues or surfaced priorities (like Event Sourcing belonging in Phase 1) that would have been much more expensive to discover during implementation. Measure twice, cut once. The carpenter’s rule exists for a reason.
Bring your domain knowledge — and don’t be shy about pushing back. Claude made strong default recommendations, but the most valuable moments came when I disagreed based on my understanding of Hazelcast and the architecture I wanted. The AI is a powerful collaborator, but it doesn’t know what you know. If something feels wrong, say so. That’s where the real value of the collaboration happens.
And document everything. I mean it. The design documents weren’t just planning artifacts — they became living reference material that Claude Code used throughout implementation. The CLAUDE.md file in particular became a continuously evolving guide that shaped code quality across the entire project. Every hour spent on documentation saved multiples in “no, that’s not what I meant” corrections later. I’ve never been great about documentation discipline, so having an AI that actually reads and follows the docs was a surprisingly effective motivator to keep them current.
How a side project connecting Event Sourcing to Hazelcast sat
unfinished for years — and why I decided to bring it back with an AI
collaborator.
In my previous post, I shared some of my
thinking about Event-Driven Microservices — the coupling problems, the
mental shift toward thinking in events, and the patterns (Event
Sourcing, CQRS, materialized views) that make it all work. That post
was conceptual. This one is personal.
I’ve been playing around with design concepts in this area for some
time. While I was an employee of Hazelcast, I frequently worked with
customers and prospects to show how Hazelcast Jet — an event stream
processing engine built into the Hazelcast platform — could be used to
build event processing solutions that would scale while continuing to
provide low latency. These conversations were always framed around
stream processing, though. Even when the intended use case was around
microservices, we didn’t explicitly get into the Event Sourcing
pattern. As someone coming from a background that was
database-centric, the concept of events as the source of truth was a
bit much for me.
The Light Bulb Moment
It was a light bulb moment when I realized that Hazelcast Jet could
fit naturally into an Event Sourcing architecture — and that Hazelcast
IMDG (the in-memory data grid, or caching layer) could concurrently
maintain materialized views representing the current state of domain
objects.
Think about it: Event Sourcing needs an event log and a processing
pipeline. Hazelcast Jet is a processing pipeline. CQRS needs a fast
read-side store that’s kept in sync with the event stream. Hazelcast
IMDG is a fast read-side store. Event Sourcing + CQRS maps
beautifully onto Jet + IMDG (even though that acronym is officially
retired — it’s all just “Hazelcast” now).
And from there, I really wanted to demonstrate this. The original
Microservices Framework project began.
Version 1: The Proof of Concept
The first version was focused on proving the core idea worked. Could I
wire up a Hazelcast Jet pipeline to process domain events, persist
them to an event store, and update materialized views — all in a way
that was generic enough to work across different services?
The answer was yes. The central pattern that emerged was
straightforward: a service’s handleEvent() method writes incoming
events to a PendingEvents map, which triggers a Jet pipeline that
persists events to the EventStore, updates materialized views, and
publishes to an event bus for other services to consume. It worked,
and it was fast.
Now, the central components of the architecture — the domain object,
event class, controller, and pipeline — have survived relatively
intact through multiple iterations of the implementation. The bones
were good. But a lot of the specific implementation choices I made
around those bones haven’t aged all that well.
You know how it goes with side projects. Technical debt accumulates
quietly, one “I’ll fix this later” at a time, until you’re looking at
a codebase where you know you’d make different choices if you were
starting over — but the sunk cost of time already invested keeps you
from actually doing it. It’s the software equivalent of a kitchen
renovation where you keep patching the old cabinets because ripping
them out feels like too big a project for a weekend.
That version of the framework is still hanging around on GitHub,
although I decided not to link to it here as I may take it down at
any time. (Upcoming posts will link to the improved version, so
embedding links to the original will inevitably lead to someone
grabbing the wrong one.)
I got it to a working state, but there was a long list of things I
wanted to add. Saga patterns for coordinating multi-service
transactions. Observability dashboards. Comprehensive tests.
Documentation that went beyond “read the code.” Each of these was a
meaningful chunk of work, and progress slowed to a crawl.
The Stall
Let’s be honest about what happened: the project stalled. Not
dramatically — it wasn’t ever really abandoned. It just… stopped
moving. Every few months I’d open the codebase, when I had some extra
time, and make a few minor, inconsequential changes while thinking of
the more ambitious refactorings or added features that I’d get to when
time permitted.
If you’ve ever maintained a passion project alongside a day job, you
know this feeling. The ideas don’t go away — they sit in the back of
your mind, periodically surfacing with a pang of “I should really get
back to that.” But the activation energy to restart is high, especially
when the next step isn’t a fun new feature but the grind of
scaffolding, configuration, and test coverage. So you close the laptop
and tell yourself next month will be different. (It won’t be.)
Enter AI-Assisted Development
In early 2025, I started using Claude for various coding tasks and was
genuinely surprised by the results. This wasn’t autocomplete on
steroids — I could describe an architectural pattern and get back code
that understood the why, not just the what. I could say “this
needs to work like an event journal with replay capability” and get
something that actually accounted for ordering guarantees and
idempotency.
That’s when the thought crystallized: what if I could use this to
break through the stall?
Here’s the thing — the stuff that had been blocking me wasn’t the hard
design work. I knew what the architecture should look like. The
bottleneck was the sheer volume of implementation grind: scaffolding
new services, writing comprehensive tests, wiring up Docker
configurations, producing documentation. Exactly the kind of work
where you need focused hours, and a side project never has enough of
those.
Now, I want to be clear about what I mean here, because “AI wrote my
code” carries a lot of baggage. This wasn’t about handing off the
project and checking back in when it was done. It was about having a
collaborator who could take high-level design direction and turn it
into working code at a pace that made the project viable again. I’d
provide the domain expertise, the architectural decisions, and the
quality bar. The AI would provide the throughput.
Making the Decision
I decided to move forward with a clean reimplementation rather than
trying to evolve the existing codebase. The core patterns from the
original work — the Jet pipeline architecture, the event store design,
the materialized view update strategy — were proven and would carry
forward. But the project structure, package naming, dependency
versions, and framework abstractions would start fresh. Sometimes the
best way to fix a kitchen is to actually rip out the cabinets.
The plan was to use Claude’s desktop interface for iterative design
discussions (requirements, architecture, implementation planning) and
then hand off to Claude Code for the actual coding. Design first,
then build — with comprehensive documentation at every step so the AI
would have rich context to work from.
What happened next — the design phase, the handoff to Claude Code,
and the surprises along the way — is the subject of the
next post.