At 2:17 AM, the dashboard wasn’t moving.
Transactions were still coming in, but the numbers on the screen were frozen. The system was running, but the data was already 10 minutes behind.
At that scale, 10 minutes wasn’t just delay — it meant decisions based on outdated information.
That was the moment everything had to change.
The Problem: When Data Arrives Too Late
In many systems, data pipelines are built around batch processing or synchronous APIs. They work — until they don’t.
As traffic grows, these systems start to break in predictable ways:
Increasing latency
Bottlenecks in database reads
Inconsistent data across services
Delayed analytics and reporting
The core issue is simple:
the system reacts too late.
To solve this, the architecture had to shift from request-driven to event-driven.
The Shift: Thinking in Events
Instead of asking for data, services react to events as they happen.
Every transaction becomes an event. Every state change is recorded as a fact.
This changes everything:
Systems become asynchronous
Components are loosely coupled
Scalability comes from partitioning, not vertical scaling
The goal was clear:
Process more than 15,000 events per second with sub-50ms latency — reliably.
Designing the Pipeline
The first version of the system was built as a distributed architecture using open-source tools.
Core components:
- Apache Kafka for event streaming
- Kafka Streams for real-time processing
- Spring Boot for processing services
- PostgreSQL for durable storage
- Redis for low-latency reads
- Prometheus and Grafana for observability
Event Flow
The pipeline follows a simple but powerful structure:
Producer → Kafka → Stream Processing → Storage → Analytics
Producers publish transaction events
Events are serialized and validated
Kafka distributes them across partitions
Stream processors handle transformations and aggregations
Results are stored and made available for real-time queries
This allowed continuous processing instead of waiting for batches.
The Challenges (Where Systems Usually Break)
Designing the pipeline was only part of the work. Making it reliable at scale required solving real issues:
Event Duplication
Ensuring exactly-once processing required transactional guarantees and careful handling of offsets.
Latency Spikes
Under heavy load, consumer lag increased. This was mitigated with parallel consumers and optimized batching.
Throughput Optimization
Producer batching (32KB) and Snappy compression significantly improved performance.
Resource Bottlenecks
Connection pooling and efficient serialization reduced pressure on downstream systems.
These are the kinds of problems that don’t show up in diagrams — only in real systems.
Results
After multiple iterations and optimizations, the system achieved:
Throughput: 15,000+ events per second
Latency (P99): under 50ms
Availability: 99.95%
Data loss: 0% (exactly-once processing)
More importantly, this enabled near real-time anomaly detection.
What used to take minutes could now happen in milliseconds.
Moving to AWS: Same Architecture, Less Infrastructure
Once the system was stable, the next step was moving to the cloud.
The key decision was not to redesign the system — only to replace infrastructure with managed services.
Cloud Architecture:
Event ingestion via Amazon EventBridge or Amazon MSK
Processing with AWS Lambda
Workflow orchestration using AWS Step Functions
Storage with DynamoDB or Amazon RDS
Monitoring through Amazon CloudWatch
Each component maps directly to the original architecture.
The logic stays the same. The operations overhead disappears.
The Key Insight
The most important lesson wasn’t about tools.
It was about understanding the system.
Event-driven architectures are built on a few core principles:
Events are immutable
Systems react asynchronously
Scalability comes from partitioned streams
State is derived from event logs
Once these are clear, moving between technologies becomes straightforward.
Final Thoughts
Real-time systems are not about speed alone.
They are about enabling better decisions.
When data arrives late, the system might still work — but the business doesn’t.
Designing event-driven pipelines is ultimately about reducing the gap between what happens and what the system knows.
Design, therefore I exist.