Building Production-Ready Spring Boot Microservices: Lessons from Scaling to Millions

Question

Building Production-Ready Spring Boot Microservices: Lessons from Scaling to Millions

calendar_todayFeb 11 • schedule5 min read

Introduction

Over the past several years working in e-commerce, I've architected and scaled microservices handling millions of customer requests daily, including platforms processing over $100M in transactions. In this article, I'll share practical patterns for building Spring Boot microservices that can handle production scale.

This isn't theory—these are battle-tested approaches proven under real-world load.

1. Service Boundaries and Domain-Driven Design

The biggest mistake teams make is creating microservices that are too granular or tightly coupled.

Key principle: Each service should own a complete business capability.

// ❌ BAD: Services that are too granular
CustomerNameService
CustomerAddressService

// ✅ GOOD: Services that own complete domains
CustomerProfileService // owns all customer identity/profile data
SubscriptionService    // owns subscriptions, benefits, rewards

When building a subscription platform, a SubscriptionService should own the complete domain: enrollment, benefits tracking, payment integration, and conversions. This prevents the "distributed monolith" where you call 10 services for one operation.

2. Handling Distributed Transactions with Saga Pattern

At scale, distributed transactions become your enemy. Here's the orchestration approach:

@Service
public class SubscriptionEnrollmentOrchestrator {
    
    @Transactional
    public EnrollmentResult enrollMember(EnrollmentRequest request) {
        // Step 1: Validate customer (synchronous)
        CustomerProfile customer = customerClient.getCustomer(request.getCustomerId())
            .orElseThrow(() -> new CustomerNotFoundException());
            
        // Step 2: Process payment (synchronous with compensation)
        PaymentResult payment;
        try {
            payment = paymentClient.processPayment(request.getPaymentMethod());
        } catch (PaymentException e) {
            throw new EnrollmentFailedException("Payment failed", e);
        }
        
        // Step 3: Create subscription (local transaction)
        Subscription subscription;
        try {
            subscription = createSubscription(customer, payment);
            subscriptionRepository.save(subscription);
        } catch (Exception e) {
            // Compensate payment
            paymentClient.refundPayment(payment.getTransactionId());
            throw new EnrollmentFailedException("Subscription creation failed", e);
        }
        
        // Step 4: Publish event for async operations
        eventPublisher.publish(new SubscriptionEnrolledEvent(subscription));
        
        return EnrollmentResult.success(subscription);
    }
}

Key lessons:

Make operations idempotent - Include transaction IDs so retries don't cause duplicates
Compensate explicitly - If step N fails, undo steps 1 through N-1
Separate critical path from eventual consistency - Payment is synchronous, emails are async

3. Strategic Caching for High-Traffic Systems

We reduced database load by 85% through strategic caching:

@Service
public class SubscriptionService {
    
    private final LoadingCache<String, Optional<Subscription>> cache;
    
    public SubscriptionService(SubscriptionRepository repository) {
        this.cache = CacheBuilder.newBuilder()
            .maximumSize(10_000)
            .expireAfterWrite(5, TimeUnit.MINUTES)
            .recordStats()
            .build(CacheLoader.from(repository::findByCustomerId));
    }
    
    @Transactional
    public Subscription updateSubscription(String id, SubscriptionUpdate update) {
        Subscription updated = repository.findById(id)
            .map(s -> applyUpdate(s, update))
            .orElseThrow(() -> new SubscriptionNotFoundException(id));
            
        repository.save(updated);
        cache.invalidate(updated.getCustomerId()); // Invalidate on write
        
        return updated;
    }
}

Caching layers:

Application-level (Guava/Caffeine) - 5-minute TTL for hot data
Redis distributed cache - 15-minute TTL for cross-instance consistency
Database with proper indexing

4. Circuit Breakers and Resilience

When calling other services, assume they will fail:

@Service
public class CustomerProfileClient {
    
    @CircuitBreaker(name = "customer-profile", fallbackMethod = "getCustomerFallback")
    @TimeLimiter(name = "customer-profile")
    public CompletableFuture<CustomerProfile> getCustomer(String customerId) {
        return CompletableFuture.supplyAsync(() -> 
            restTemplate.getForObject("/api/customers/" + customerId, CustomerProfile.class)
        );
    }
    
    private CompletableFuture<CustomerProfile> getCustomerFallback(String customerId, Exception ex) {
        log.warn("Customer service unavailable, using cached data");
        return CompletableFuture.completedFuture(
            customerCache.get(customerId)
                .orElseThrow(() -> new ServiceUnavailableException("No cached data", ex))
        );
    }
}

Configuration:

CircuitBreakerConfig.custom()
    .failureRateThreshold(50)          // Open if 50% fail
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .slidingWindowSize(100)            // Consider last 100 calls
    .slowCallDurationThreshold(Duration.ofSeconds(2))
    .build();

5. Structured Logging and Distributed Tracing

You can't fix what you can't see:

@Slf4j
@RestController
public class SubscriptionController {
    
    @PostMapping("/subscriptions")
    public ResponseEntity<SubscriptionResponse> enrollMember(
            @RequestBody EnrollmentRequest request,
            @RequestHeader("X-Request-ID") String requestId) {
        
        MDC.put("requestId", requestId);
        MDC.put("customerId", request.getCustomerId());
        
        try {
            log.info("Starting enrollment", 
                kv("tier", request.getTier()),
                kv("paymentMethod", request.getPaymentMethod().getType()));
            
            Instant start = Instant.now();
            EnrollmentResult result = enrollmentService.enroll(request);
            
            log.info("Enrollment completed",
                kv("subscriptionId", result.getSubscriptionId()),
                kv("durationMs", Duration.between(start, Instant.now()).toMillis()));
            
            return ResponseEntity.ok(toResponse(result));
        } finally {
            MDC.clear();
        }
    }
}

Logging standards:

Use structured key-value pairs, not string concatenation
Track request IDs across service boundaries
Include customer ID and operation type in every log

6. Database Query Optimization

Slow queries kill scalability:

@Entity
@Table(name = "subscriptions", indexes = {
    @Index(name = "idx_customer_id", columnList = "customer_id"),
    @Index(name = "idx_tier_status", columnList = "tier, status"),
    @Index(name = "idx_expires_at", columnList = "expires_at")
})
public class Subscription {
    // Use @BatchSize to prevent N+1 queries
    @OneToMany(mappedBy = "subscription", fetch = FetchType.LAZY)
    @BatchSize(size = 50)
    private List<Benefit> benefits;
}

@Repository
public interface SubscriptionRepository extends JpaRepository<Subscription, String> {
    
    // Custom query with fetch join to avoid N+1
    @Query("SELECT s FROM Subscription s LEFT JOIN FETCH s.benefits WHERE s.customerId = :customerId")
    Optional<Subscription> findByCustomerIdWithBenefits(@Param("customerId") String customerId);
}

Performance lessons:

Index everything you query by
Use batch fetching to prevent N+1 queries
Paginate large results
Monitor and alert on queries >100ms

7. Connection Pool Tuning

spring:
  datasource:
    hikari:
      # Formula: (core_count * 2) + effective_spindle_count
      # For 4-core with SSD: (4 * 2) + 1 = 9
      maximum-pool-size: 10
      minimum-idle: 5
      connection-timeout: 3000
      validation-timeout: 1000
      leak-detection-threshold: 60000
      max-lifetime: 1800000  # 30 minutes

8. Zero-Downtime Deployments

# kubernetes/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: subscription-service
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 3        # Add 50% extra during update
      maxUnavailable: 1  # Keep 5/6 running
  template:
    spec:
      containers:
      - name: subscription-service
        readinessProbe:
          httpGet:
            path: /actuator/health/readiness
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /actuator/health/liveness
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 10

9. Metrics That Matter

@Service
public class MetricsService {
    
    private final Counter enrollmentsCounter;
    private final Timer enrollmentDuration;
    
    public MetricsService(MeterRegistry registry) {
        this.enrollmentsCounter = Counter.builder("subscription.enrollments")
            .tag("service", "subscription")
            .register(registry);
            
        this.enrollmentDuration = Timer.builder("subscription.enrollment.duration")
            .publishPercentiles(0.5, 0.95, 0.99)
            .register(registry);
    }
}

Monitor:

Business: Enrollments/hour, conversion rate, revenue
Technical: P50/P95/P99 latency, error rate, throughput
Infrastructure: CPU, memory, connection pool utilization

10. Testing Strategy

// Unit tests - fast, isolated
@Test
void shouldCalculateCorrectExpirationDate() {
    EnrollmentService service = new EnrollmentService(
        mock(CustomerClient.class),
        mock(PaymentClient.class),
        mock(SubscriptionRepository.class)
    );
    
    Instant enrolledAt = Instant.parse("2024-01-01T00:00:00Z");
    Instant expected = Instant.parse("2025-01-01T00:00:00Z");
    
    assertEquals(expected, service.calculateExpiration(enrolledAt, SubscriptionTier.ANNUAL));
}

// Integration tests - with real database
@SpringBootTest
@Testcontainers
class SubscriptionServiceIntegrationTest {
    
    @Container
    static PostgreSQLContainer<?> postgres = new PostgreSQLContainer<>("postgres:15");
    
    @Test
    void shouldPersistAndRetrieveSubscription() {
        EnrollmentResult result = service.enroll(request);
        Optional<Subscription> retrieved = service.getSubscription(result.getSubscriptionId());
        
        assertTrue(retrieved.isPresent());
        assertEquals(SubscriptionTier.BASIC, retrieved.get().getTier());
    }
}

Key Takeaways

After building and scaling microservices handling millions of requests:

Design for failure - Circuit breakers, timeouts, and fallbacks are not optional
Embrace eventual consistency - Not everything needs to be synchronous
Observability is critical - You can't improve what you can't measure
Cache aggressively - But invalidate intelligently
Test at every level - Unit, integration, contract, and load tests
Automate deployments - Including rollbacks and scaling
Performance is a feature - Optimize from day one

What's Next?

In future posts, I'll dive deeper into:

Advanced observability with distributed tracing
Chaos engineering and resilience testing
Cost optimization for cloud deployments

Questions or want to discuss these patterns? Leave a comment below or connect with me on LinkedIn.

Tags: #spring-boot #microservices #java #system-design #aws #architecture #scalability #production-engineering

3 Comments

🔥 Join developers growing publicly

Share your knowledge, build in public, and grow your developer presence with a global community.

Join CoderLegion

chevron_left

VAMSI THATIKONDA

148 Points • 6 Badges

1Posts

0Comments

Software Engineer with 20+ years of experience building scalable platforms and leading high-performi... Show more

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

Lena Carter · Answer 1 · 2026-02-12T16:43:00+0000

Nice point on separating critical paths from eventual consistency, Vamsi, makes me wonder how you decide what can safely be async in a fast-moving system?

stjam · Answer 2 · 2026-02-18T17:24:08+0000

stjam • Feb 18

Thank you for sharing your experience

buildbasekit · Answer 3 · 2026-06-25T17:35:49+0000

I like that you called out observability as a first-class concern instead of treating it as something to add later.

In my experience, teams often spend weeks optimizing code paths only to realize they don't have enough metrics or tracing to identify the real bottleneck. Having request IDs, structured logs, and latency metrics from the start makes performance tuning much more data-driven.

Nice collection of practical lessons!

	Helping Clients Move from Pilot to Production: The Agentic AI Governance Playbook Tom Smithverified - Jun 8
	All Spring Boot Annotations — One Cheat Sheet Hector Williams - Mar 4
	The Ecosystem Conductor: Orchestrating Market Orders with Kafka, RabbitMQ and Spring Boot rvneto - May 12
	Building a Microservices Ecosystem: Stock Brokerage Simulator (My Broker B3) rvneto - May 1
	Upload Files to AWS S3 with Spring Boot (Clean Architecture Guide) buildbasekit - Apr 12

Building Production-Ready Spring Boot Microservices: Lessons from Scaling to Millions

Introduction

1. Service Boundaries and Domain-Driven Design

2. Handling Distributed Transactions with Saga Pattern

3. Strategic Caching for High-Traffic Systems

4. Circuit Breakers and Resilience

5. Structured Logging and Distributed Tracing

6. Database Query Optimization

7. Connection Pool Tuning

8. Zero-Downtime Deployments

9. Metrics That Matter

10. Testing Strategy

Key Takeaways

What's Next?

3 Comments

Please log in to add a comment.

Please log in to add a comment.

Please log in to add a comment.

Please log in to comment on this post.

More Posts

Helping Clients Move from Pilot to Production: The Agentic AI Governance Playbook

All Spring Boot Annotations — One Cheat Sheet

The Ecosystem Conductor: Orchestrating Market Orders with Kafka, RabbitMQ and Spring Boot

Building a Microservices Ecosystem: Stock Brokerage Simulator (My Broker B3)

Upload Files to AWS S3 with Spring Boot (Clean Architecture Guide)

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,587 amazing developers

Don't have an account? Sign up

OR

Building Production-Ready Spring Boot Microservices: Lessons from Scaling to Millions

Introduction

1. Service Boundaries and Domain-Driven Design

2. Handling Distributed Transactions with Saga Pattern

3. Strategic Caching for High-Traffic Systems

4. Circuit Breakers and Resilience

5. Structured Logging and Distributed Tracing

6. Database Query Optimization

7. Connection Pool Tuning

8. Zero-Downtime Deployments

9. Metrics That Matter

10. Testing Strategy

Key Takeaways

What's Next?

3 Comments

Please log in to add a comment.

Please log in to add a comment.

Please log in to add a comment.

Please log in to comment on this post.

More Posts

Helping Clients Move from Pilot to Production: The Agentic AI Governance Playbook

All Spring Boot Annotations — One Cheat Sheet

The Ecosystem Conductor: Orchestrating Market Orders with Kafka, RabbitMQ and Spring Boot

Building a Microservices Ecosystem: Stock Brokerage Simulator (My Broker B3)

Upload Files to AWS S3 with Spring Boot (Clean Architecture Guide)

Related Jobs

Commenters (This Week)