Maos Project

The Problem

It was supposed to be a routine upgrade.

Instead, a single NGINX setting brought down our entire multi-tenant Kubernetes platform while deploying Kubeflow on MAOS (Multi-tenant Application Orchestration System).

We had just upgraded to ingress-nginx 4.14.0. Everything looked normal… until the controller started crashlooping.

Error: exit status 1
2025/11/20 20:11:38 [emerg] 36#36: "proxy_busy_buffers_size" must be less than 
the size of all "proxy_buffers" minus one buffer in /tmp/nginx/nginx-cfg2318361464:401
nginx: [emerg] "proxy_busy_buffers_size" must be less than the size of all 
"proxy_buffers" minus one buffer in /tmp/nginx/nginx-cfg2318361464:401
nginx: configuration file /tmp/nginx/nginx-cfg2318361464 test failed

The controller would start, attempt to reload its configuration, fail validation, and crash. Rinse and repeat. Our entire ingress layer was down — taking Kubeflow, ArgoCD, our monitoring stack and client-facing APIs with it.

The frustrating part? Our configuration looked perfectly reasonable:

proxy-buffer-size: "32k"
proxy-buffers: "8 32k"
proxy-busy-buffers-size: "192k"

Let me walk you through what we learned about NGINX proxy buffers, why this matters for MLOps and regular API workloads, and how to configure them properly.

Understanding NGINX Proxy Buffers

When NGINX proxies a request to an upstream service, it doesn't stream the response directly to the client. It first stores it in memory using buffers. This allows NGINX to:

Free up the backend connection faster
Handle slow clients better
Apply caching, rate limits, and transformations

The main players

proxy_buffer_size → Buffer for response headers
proxy_buffers → Buffers for response body
proxy_busy_buffers_size → Max memory in use while sending to the client
client_body_buffer_size → Buffer for inbound request body

The critical NGINX rule

proxy_busy_buffers_size < (total proxy_buffers - one buffer)

NGINX needs at least one free buffer while other buffers are sending data to the client.

The math that looked right

proxy-buffers: "8 32k"         # 256k total
proxy-busy-buffers-size: 192k  # Looks valid

By simple math, this should have passed validation.

But NGINX doesn't just check the total — it also validates how proxy_buffer_size relates to the unit size in proxy_buffers.

In our case, both were set to 32k, and that overlap triggers a known internal validation conflict where NGINX miscalculates available space.

So although the math looked correct, the configuration was still considered invalid and rejected at load time.

This is documented (and complained about) in this issue: 👉 https://github.com/kubernetes/ingress-nginx/issues/13598

MLOps Workloads vs Regular APIs

This isn't just a Kubeflow problem.

Regular APIs

Request body: 1KB – 5MB
Response body: 1KB – 500KB
Headers: 1-4KB (Auth tokens, tracing, cookies)

Even "normal" microservices can break with the wrong buffers when:

Returning large JSON responses
Using big headers (JWT, OAuth, cookies)
Uploading files

MLOps / Kubeflow / MLflow

Model uploads: 100MB – 10GB
Notebook files: 1–50MB
Pipelines: 500KB – 5MB
Artifacts: 1GB+

This is where default settings absolutely break without tuning.

But here's the key mistake: We tried to tune buffers globally.

The Trap: Global Configuration

We originally put this in the NGINX controller:

config:
  proxy-buffer-size: "32k"
  proxy-buffers: "8 32k"
  proxy-busy-buffers-size: "192k"

Why this is dangerous

Helm chart defaults override parts of your config
Every connection gets these buffers (even /healthz)
Different apps need different memory strategies
One mistake can break the entire ingress

What we wanted vs what we got:

# Expected
proxy-buffer-size: "16k"
proxy-buffers: "4 64k"

# Actual (from ConfigMap)
proxy-buffer-size: "32k"
proxy-buffers: "8 32k"
proxy-busy-buffers-size: "192k"

Boom. CrashLoop.

The Real Solution: Per-Ingress Configuration

Instead of global tuning, move buffer settings to each Ingress object.

Kubeflow / ML workloads

annotations:
  nginx.ingress.kubernetes.io/proxy-body-size: "10G"
  nginx.ingress.kubernetes.io/proxy-buffer-size: "16k"
  nginx.ingress.kubernetes.io/proxy-buffers-number: "8"
  nginx.ingress.kubernetes.io/proxy-busy-buffers-size: "128k"
  nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
  nginx.ingress.kubernetes.io/client-body-buffer-size: "128k"

ArgoCD

annotations:
  nginx.ingress.kubernetes.io/proxy-body-size: "100m"
  nginx.ingress.kubernetes.io/proxy-buffer-size: "8k"
  nginx.ingress.kubernetes.io/ssl-passthrough: "true"

Regular APIs

annotations:
  kubernetes.io/ingress.class: nginx
  # Defaults are enough

Now each service defines what it actually needs — no more over-tuning or global risk.

Why This Works Better

Benefit	Why it matters
Safer	No global crash risk
Efficient	Only heavy apps use big memory
Isolated	One service can't kill the platform
Readable	Ingress = documentation
Upgrade safe	Helm defaults can change freely

Memory Example

Strategy	Memory usage
Global buffers	~32MB / 1000 connections
Per-ingress buffers	~6–9MB total
Default only	~4MB

That's a massive improvement.

Final Production Setup

NGINX controller

controller:
  config:
    use-forwarded-headers: "true"
    enable-real-ip: "true"
    proxy-real-ip-cidr: <VPC CIDR>
    client-body-buffer-size: "128k"
    large-client-header-buffers: "4 32k"

Kubeflow

annotations:
  nginx.ingress.kubernetes.io/proxy-body-size: "10G"
  nginx.ingress.kubernetes.io/proxy-buffer-size: "16k"
  nginx.ingress.kubernetes.io/proxy-read-timeout: "600"

ArgoCD

nginx.ingress.kubernetes.io/proxy-body-size: "100m"

Normal apps

✅ No change

Lessons Learned

✅ Defaults are smarter than you think
✅ Global buffer tuning is dangerous
✅ MLOps and APIs need different profiles
✅ Granularity = stability

Closing

Samuel, Founder & CTO of MAOS (Multi-tenant Application Orchestration System), is building a Kubernetes-as-a-Service platform that provisions dedicated, production-ready EKS clusters with full MLOps stacks, GitOps, monitoring, and scaling built in from day one.

If you're building or operating a platform for multiple teams and tenants, this kind of failure is inevitable — unless you design for isolation, intent, and granularity from the start.

Follow the journey or contribute at maosproject.io

NGINX Proxy Buffers: Why Your Kubeflow Deployment Keeps Crashing (And How to Fix It)

The Problem

Understanding NGINX Proxy Buffers

The main players

The critical NGINX rule

The math that looked right

MLOps Workloads vs Regular APIs

Regular APIs

MLOps / Kubeflow / MLflow

The Trap: Global Configuration

Why this is dangerous

The Real Solution: Per-Ingress Configuration

Kubeflow / ML workloads

ArgoCD

Regular APIs

Why This Works Better

Memory Example

Final Production Setup

NGINX controller

Kubeflow

ArgoCD

Normal apps

Lessons Learned

Closing

Building a platform?

Join the MAOS early-access