NGINX Proxy Buffers: Why Your Kubeflow Deployment Keeps Crashing (And How to Fix It)
How we debugged a crashlooping NGINX Ingress Controller and learned the hard way about buffer configuration for MLOps and regular API workloads
The Problem
It was supposed to be a routine upgrade.
Instead, a single NGINX setting brought down our entire multi-tenant Kubernetes platform while deploying Kubeflow on MAOS (Multi-tenant Application Orchestration System).
We had just upgraded to ingress-nginx 4.14.0. Everything looked normal… until the controller started crashlooping.
Error: exit status 1
2025/11/20 20:11:38 [emerg] 36#36: "proxy_busy_buffers_size" must be less than
the size of all "proxy_buffers" minus one buffer in /tmp/nginx/nginx-cfg2318361464:401
nginx: [emerg] "proxy_busy_buffers_size" must be less than the size of all
"proxy_buffers" minus one buffer in /tmp/nginx/nginx-cfg2318361464:401
nginx: configuration file /tmp/nginx/nginx-cfg2318361464 test failed
The controller would start, attempt to reload its configuration, fail validation, and crash. Rinse and repeat. Our entire ingress layer was down — taking Kubeflow, ArgoCD, our monitoring stack and client-facing APIs with it.
The frustrating part? Our configuration looked perfectly reasonable:
proxy-buffer-size: "32k"
proxy-buffers: "8 32k"
proxy-busy-buffers-size: "192k"
Let me walk you through what we learned about NGINX proxy buffers, why this matters for MLOps and regular API workloads, and how to configure them properly.
Understanding NGINX Proxy Buffers
When NGINX proxies a request to an upstream service, it doesn't stream the response directly to the client. It first stores it in memory using buffers. This allows NGINX to:
- Free up the backend connection faster
- Handle slow clients better
- Apply caching, rate limits, and transformations
The main players
proxy_buffer_size→ Buffer for response headersproxy_buffers→ Buffers for response bodyproxy_busy_buffers_size→ Max memory in use while sending to the clientclient_body_buffer_size→ Buffer for inbound request body
The critical NGINX rule
proxy_busy_buffers_size < (total proxy_buffers - one buffer)
NGINX needs at least one free buffer while other buffers are sending data to the client.
The math that looked right
proxy-buffers: "8 32k" # 256k total
proxy-busy-buffers-size: 192k # Looks valid
By simple math, this should have passed validation.
But NGINX doesn't just check the total — it also validates how proxy_buffer_size relates to the unit size in proxy_buffers.
In our case, both were set to 32k, and that overlap triggers a known internal validation conflict where NGINX miscalculates available space.
So although the math looked correct, the configuration was still considered invalid and rejected at load time.
This is documented (and complained about) in this issue: 👉 https://github.com/kubernetes/ingress-nginx/issues/13598
MLOps Workloads vs Regular APIs
This isn't just a Kubeflow problem.
Regular APIs
- Request body: 1KB – 5MB
- Response body: 1KB – 500KB
- Headers: 1-4KB (Auth tokens, tracing, cookies)
Even "normal" microservices can break with the wrong buffers when:
- Returning large JSON responses
- Using big headers (JWT, OAuth, cookies)
- Uploading files
MLOps / Kubeflow / MLflow
- Model uploads: 100MB – 10GB
- Notebook files: 1–50MB
- Pipelines: 500KB – 5MB
- Artifacts: 1GB+
This is where default settings absolutely break without tuning.
But here's the key mistake: We tried to tune buffers globally.
The Trap: Global Configuration
We originally put this in the NGINX controller:
config:
proxy-buffer-size: "32k"
proxy-buffers: "8 32k"
proxy-busy-buffers-size: "192k"
Why this is dangerous
- Helm chart defaults override parts of your config
- Every connection gets these buffers (even
/healthz) - Different apps need different memory strategies
- One mistake can break the entire ingress
What we wanted vs what we got:
# Expected
proxy-buffer-size: "16k"
proxy-buffers: "4 64k"
# Actual (from ConfigMap)
proxy-buffer-size: "32k"
proxy-buffers: "8 32k"
proxy-busy-buffers-size: "192k"
Boom. CrashLoop.
The Real Solution: Per-Ingress Configuration
Instead of global tuning, move buffer settings to each Ingress object.
Kubeflow / ML workloads
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "10G"
nginx.ingress.kubernetes.io/proxy-buffer-size: "16k"
nginx.ingress.kubernetes.io/proxy-buffers-number: "8"
nginx.ingress.kubernetes.io/proxy-busy-buffers-size: "128k"
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
nginx.ingress.kubernetes.io/client-body-buffer-size: "128k"
ArgoCD
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "100m"
nginx.ingress.kubernetes.io/proxy-buffer-size: "8k"
nginx.ingress.kubernetes.io/ssl-passthrough: "true"
Regular APIs
annotations:
kubernetes.io/ingress.class: nginx
# Defaults are enough
Now each service defines what it actually needs — no more over-tuning or global risk.
Why This Works Better
| Benefit | Why it matters |
|---|---|
| Safer | No global crash risk |
| Efficient | Only heavy apps use big memory |
| Isolated | One service can't kill the platform |
| Readable | Ingress = documentation |
| Upgrade safe | Helm defaults can change freely |
Memory Example
| Strategy | Memory usage |
|---|---|
| Global buffers | ~32MB / 1000 connections |
| Per-ingress buffers | ~6–9MB total |
| Default only | ~4MB |
That's a massive improvement.
Final Production Setup
NGINX controller
controller:
config:
use-forwarded-headers: "true"
enable-real-ip: "true"
proxy-real-ip-cidr: <VPC CIDR>
client-body-buffer-size: "128k"
large-client-header-buffers: "4 32k"
Kubeflow
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "10G"
nginx.ingress.kubernetes.io/proxy-buffer-size: "16k"
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
ArgoCD
nginx.ingress.kubernetes.io/proxy-body-size: "100m"
Normal apps
✅ No change
Lessons Learned
- ✅ Defaults are smarter than you think
- ✅ Global buffer tuning is dangerous
- ✅ MLOps and APIs need different profiles
- ✅ Granularity = stability
Closing
Samuel, Founder & CTO of MAOS (Multi-tenant Application Orchestration System), is building a Kubernetes-as-a-Service platform that provisions dedicated, production-ready EKS clusters with full MLOps stacks, GitOps, monitoring, and scaling built in from day one.
If you're building or operating a platform for multiple teams and tenants, this kind of failure is inevitable — unless you design for isolation, intent, and granularity from the start.
Follow the journey or contribute at maosproject.io
Join the MAOS waitlist
Get early access to dedicated EKS clusters with GitOps, monitoring, and MLOps built in