From “It Works” to Production: Fixing NDVI Rendering, Surviving WAF Blocks, and Stabilizing a Failing Server
Most backend systems don’t fail because of obvious bugs.
They fail quietly—through misleading data, unstable infrastructure, and assumptions that don’t hold under load.
This week, I pushed an NDVI (Normalized Difference Vegetation Index) pipeline through a full transition:
from “it renders images” → to a production-grade, resilient system
This article breaks down what actually went wrong—and how each fix improved the system holistically.
1. The Illusion of Correctness: When “Green” Lies
The first issue looked harmless:
NDVI raster PNGs were rendering—but everything was green.
Root Cause
The normalization formula:
normalized = (ndvi + 1.0) / 2.0
This assumes NDVI values span [-1, 1].
But real-world farm data was tighter:
NDVI range: 0.42 → 0.63
After normalization:
0.71 → 0.81 → Entirely “green zone”
So the system was technically correct—but visually misleading.
2. The Fix: Dual Normalization Strategy
Instead of replacing one approach with another, I introduced two modes:
Histogram-Based Normalization (Default)
normalized = (ndvi - ndvi_min) / (ndvi_max - ndvi_min)
- Expands contrast per image
- Reveals actual variation
- Best for analysis
Fixed Normalization (Legacy)
normalized = (ndvi + 1.0) / 2.0
- Keeps consistency across images
- Useful for comparisons
Why This Matters
This isn’t just a rendering fix—it’s a data integrity decision:
Never force a single interpretation when the data has multiple valid contexts.
3. External Systems Don’t Like You (STAC WAF Blocks)
The next issue wasn’t in my system—it was in how it behaved externally.
Symptom
- Requests to Copernicus STAC API started failing
- Response: empty HTML or “Request Rejected”
Classic WAF (Web Application Firewall) behavior.
Root Cause
The system behaved like a bot:
- Rapid consecutive requests
- Predictable timing patterns
The Fix: Controlled Request Behavior
sleep(base_interval + random.uniform(jitter_min, jitter_max))
Implemented:
- Minimum delay between requests
- Random jitter (1–5 seconds)
- Graceful handling of invalid responses
Result
The system stopped being “blocked” and started being tolerated.
This is a key distributed systems lesson:
You don’t just integrate with APIs—you adapt to their defensive behavior.
4. Architecture Maturity: Preparing for Redis Streams
The NDVI pipeline originally used Celery for async jobs.
Instead of rewriting everything, I introduced a routing abstraction:
if settings.NDVI_QUEUE_BACKEND == "celery":
dispatch_celery_job(...)
elif settings.NDVI_QUEUE_BACKEND == "stream":
raise NotImplementedError
Why This Is Important
This small change enables:
- Seamless migration to Redis Streams
- Zero disruption to current workflows
- Incremental architecture evolution
Principle
The best migrations don’t start with new systems—they start with abstractions.
5. The Real Problem: Server Crashes Every 10 Hours
This was the most critical issue.
Symptoms
- System crashes after ~10 hours
- No single obvious failure point
Root Cause
Unbounded resource usage across multiple services:
- MySQL buffer pool growing unchecked
- PHP-FPM workers accumulating memory
- Redis using unlimited memory
- Celery workers never restarting
- Monitoring stack filling disk
6. The Fix: System-Wide Resource Governance
Instead of patching one component, I enforced limits everywhere:
Database
innodb_buffer_pool_size = 512M
max_connections = 50
PHP-FPM
pm.max_children = 15
pm.max_requests = 500
Redis
mem_limit: 512mb
maxmemory-policy: allkeys-lru
Celery
worker_max_tasks_per_child = 100
worker_max_memory_per_child = 512000 # KB
Observability Stack
- Prometheus: 7-day retention
- Loki: compaction + retention limits
Result
The system moved from eventual failure → controlled lifecycle stability
7. Observability and Documentation (The Often Ignored Layer)
Alongside code changes:
- Added alerts for system health
- Documented all NDVI configuration settings
- Created deployment + rollback guides
- Expanded test coverage
Why This Matters
Without this layer:
- Fixes don’t scale
- Bugs repeat
- Systems become tribal knowledge
8. Security Mistake (Worth Mentioning)
During the process, a sudo password was exposed in plain text.
No system is “production-ready” if secrets are handled casually.
- Never expose credentials in logs or chats
- Rotate immediately if leaked
Move toward:
- environment-based configs
- secret management systems
Final Thoughts
This wasn’t about fixing a bug.
It was about addressing three deeper problems:
- Data correctness (NDVI visualization)
- System behavior under external constraints (WAF)
- Infrastructure stability (resource limits)
Before:
“The system works.”
After:
“The system is predictable, resilient, and ready for scale.”
Key Lessons
- Correct output is not always truthful output
- External APIs require behavioral adaptation
- Systems fail through unbounded growth, not sudden errors
- Architecture evolves through abstraction, not rewrites
- Stability is a system-wide property, not a single fix
If you’re building data pipelines or distributed systems, this pattern will repeat.
The earlier you design for it—the less painful it becomes.