Introduction
Over the past several years working in e-commerce, I've architected and scaled microservices handling millions of customer requests daily, including platforms processing over $100M in transactions. In this article, I'll share practical patterns for building Spring Boot microservices that can handle production scale.
This isn't theory—these are battle-tested approaches proven under real-world load.
1. Service Boundaries and Domain-Driven Design
The biggest mistake teams make is creating microservices that are too granular or tightly coupled.
Key principle: Each service should own a complete business capability.
// ❌ BAD: Services that are too granular
CustomerNameService
CustomerAddressService
// ✅ GOOD: Services that own complete domains
CustomerProfileService // owns all customer identity/profile data
SubscriptionService // owns subscriptions, benefits, rewards
When building a subscription platform, a SubscriptionService should own the complete domain: enrollment, benefits tracking, payment integration, and conversions. This prevents the "distributed monolith" where you call 10 services for one operation.
2. Handling Distributed Transactions with Saga Pattern
At scale, distributed transactions become your enemy. Here's the orchestration approach:
@Service
public class SubscriptionEnrollmentOrchestrator {
@Transactional
public EnrollmentResult enrollMember(EnrollmentRequest request) {
// Step 1: Validate customer (synchronous)
CustomerProfile customer = customerClient.getCustomer(request.getCustomerId())
.orElseThrow(() -> new CustomerNotFoundException());
// Step 2: Process payment (synchronous with compensation)
PaymentResult payment;
try {
payment = paymentClient.processPayment(request.getPaymentMethod());
} catch (PaymentException e) {
throw new EnrollmentFailedException("Payment failed", e);
}
// Step 3: Create subscription (local transaction)
Subscription subscription;
try {
subscription = createSubscription(customer, payment);
subscriptionRepository.save(subscription);
} catch (Exception e) {
// Compensate payment
paymentClient.refundPayment(payment.getTransactionId());
throw new EnrollmentFailedException("Subscription creation failed", e);
}
// Step 4: Publish event for async operations
eventPublisher.publish(new SubscriptionEnrolledEvent(subscription));
return EnrollmentResult.success(subscription);
}
}
Key lessons:
- Make operations idempotent - Include transaction IDs so retries don't cause duplicates
- Compensate explicitly - If step N fails, undo steps 1 through N-1
- Separate critical path from eventual consistency - Payment is synchronous, emails are async
3. Strategic Caching for High-Traffic Systems
We reduced database load by 85% through strategic caching:
@Service
public class SubscriptionService {
private final LoadingCache<String, Optional<Subscription>> cache;
public SubscriptionService(SubscriptionRepository repository) {
this.cache = CacheBuilder.newBuilder()
.maximumSize(10_000)
.expireAfterWrite(5, TimeUnit.MINUTES)
.recordStats()
.build(CacheLoader.from(repository::findByCustomerId));
}
@Transactional
public Subscription updateSubscription(String id, SubscriptionUpdate update) {
Subscription updated = repository.findById(id)
.map(s -> applyUpdate(s, update))
.orElseThrow(() -> new SubscriptionNotFoundException(id));
repository.save(updated);
cache.invalidate(updated.getCustomerId()); // Invalidate on write
return updated;
}
}
Caching layers:
- Application-level (Guava/Caffeine) - 5-minute TTL for hot data
- Redis distributed cache - 15-minute TTL for cross-instance consistency
- Database with proper indexing
4. Circuit Breakers and Resilience
When calling other services, assume they will fail:
@Service
public class CustomerProfileClient {
@CircuitBreaker(name = "customer-profile", fallbackMethod = "getCustomerFallback")
@TimeLimiter(name = "customer-profile")
public CompletableFuture<CustomerProfile> getCustomer(String customerId) {
return CompletableFuture.supplyAsync(() ->
restTemplate.getForObject("/api/customers/" + customerId, CustomerProfile.class)
);
}
private CompletableFuture<CustomerProfile> getCustomerFallback(String customerId, Exception ex) {
log.warn("Customer service unavailable, using cached data");
return CompletableFuture.completedFuture(
customerCache.get(customerId)
.orElseThrow(() -> new ServiceUnavailableException("No cached data", ex))
);
}
}
Configuration:
CircuitBreakerConfig.custom()
.failureRateThreshold(50) // Open if 50% fail
.waitDurationInOpenState(Duration.ofSeconds(30))
.slidingWindowSize(100) // Consider last 100 calls
.slowCallDurationThreshold(Duration.ofSeconds(2))
.build();
5. Structured Logging and Distributed Tracing
You can't fix what you can't see:
@Slf4j
@RestController
public class SubscriptionController {
@PostMapping("/subscriptions")
public ResponseEntity<SubscriptionResponse> enrollMember(
@RequestBody EnrollmentRequest request,
@RequestHeader("X-Request-ID") String requestId) {
MDC.put("requestId", requestId);
MDC.put("customerId", request.getCustomerId());
try {
log.info("Starting enrollment",
kv("tier", request.getTier()),
kv("paymentMethod", request.getPaymentMethod().getType()));
Instant start = Instant.now();
EnrollmentResult result = enrollmentService.enroll(request);
log.info("Enrollment completed",
kv("subscriptionId", result.getSubscriptionId()),
kv("durationMs", Duration.between(start, Instant.now()).toMillis()));
return ResponseEntity.ok(toResponse(result));
} finally {
MDC.clear();
}
}
}
Logging standards:
- Use structured key-value pairs, not string concatenation
- Track request IDs across service boundaries
- Include customer ID and operation type in every log
6. Database Query Optimization
Slow queries kill scalability:
@Entity
@Table(name = "subscriptions", indexes = {
@Index(name = "idx_customer_id", columnList = "customer_id"),
@Index(name = "idx_tier_status", columnList = "tier, status"),
@Index(name = "idx_expires_at", columnList = "expires_at")
})
public class Subscription {
// Use @BatchSize to prevent N+1 queries
@OneToMany(mappedBy = "subscription", fetch = FetchType.LAZY)
@BatchSize(size = 50)
private List<Benefit> benefits;
}
@Repository
public interface SubscriptionRepository extends JpaRepository<Subscription, String> {
// Custom query with fetch join to avoid N+1
@Query("SELECT s FROM Subscription s LEFT JOIN FETCH s.benefits WHERE s.customerId = :customerId")
Optional<Subscription> findByCustomerIdWithBenefits(@Param("customerId") String customerId);
}
Performance lessons:
- Index everything you query by
- Use batch fetching to prevent N+1 queries
- Paginate large results
- Monitor and alert on queries >100ms
7. Connection Pool Tuning
spring:
datasource:
hikari:
# Formula: (core_count * 2) + effective_spindle_count
# For 4-core with SSD: (4 * 2) + 1 = 9
maximum-pool-size: 10
minimum-idle: 5
connection-timeout: 3000
validation-timeout: 1000
leak-detection-threshold: 60000
max-lifetime: 1800000 # 30 minutes
8. Zero-Downtime Deployments
# kubernetes/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: subscription-service
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 3 # Add 50% extra during update
maxUnavailable: 1 # Keep 5/6 running
template:
spec:
containers:
- name: subscription-service
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 30
periodSeconds: 5
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
9. Metrics That Matter
@Service
public class MetricsService {
private final Counter enrollmentsCounter;
private final Timer enrollmentDuration;
public MetricsService(MeterRegistry registry) {
this.enrollmentsCounter = Counter.builder("subscription.enrollments")
.tag("service", "subscription")
.register(registry);
this.enrollmentDuration = Timer.builder("subscription.enrollment.duration")
.publishPercentiles(0.5, 0.95, 0.99)
.register(registry);
}
}
Monitor:
- Business: Enrollments/hour, conversion rate, revenue
- Technical: P50/P95/P99 latency, error rate, throughput
- Infrastructure: CPU, memory, connection pool utilization
10. Testing Strategy
// Unit tests - fast, isolated
@Test
void shouldCalculateCorrectExpirationDate() {
EnrollmentService service = new EnrollmentService(
mock(CustomerClient.class),
mock(PaymentClient.class),
mock(SubscriptionRepository.class)
);
Instant enrolledAt = Instant.parse("2024-01-01T00:00:00Z");
Instant expected = Instant.parse("2025-01-01T00:00:00Z");
assertEquals(expected, service.calculateExpiration(enrolledAt, SubscriptionTier.ANNUAL));
}
// Integration tests - with real database
@SpringBootTest
@Testcontainers
class SubscriptionServiceIntegrationTest {
@Container
static PostgreSQLContainer<?> postgres = new PostgreSQLContainer<>("postgres:15");
@Test
void shouldPersistAndRetrieveSubscription() {
EnrollmentResult result = service.enroll(request);
Optional<Subscription> retrieved = service.getSubscription(result.getSubscriptionId());
assertTrue(retrieved.isPresent());
assertEquals(SubscriptionTier.BASIC, retrieved.get().getTier());
}
}
Key Takeaways
After building and scaling microservices handling millions of requests:
- Design for failure - Circuit breakers, timeouts, and fallbacks are not optional
- Embrace eventual consistency - Not everything needs to be synchronous
- Observability is critical - You can't improve what you can't measure
- Cache aggressively - But invalidate intelligently
- Test at every level - Unit, integration, contract, and load tests
- Automate deployments - Including rollbacks and scaling
- Performance is a feature - Optimize from day one
What's Next?
In future posts, I'll dive deeper into:
- Advanced observability with distributed tracing
- Chaos engineering and resilience testing
- Cost optimization for cloud deployments
Questions or want to discuss these patterns? Leave a comment below or connect with me on LinkedIn.
Tags: #spring-boot #microservices #java #system-design #aws #architecture #scalability #production-engineering