NGINX Proxy Buffers: Why Your Kubeflow Deployment Keeps Crashing (And How to Fix It)

How we debugged a crashlooping NGINX Ingress Controller and learned the hard way about buffer configuration for MLOps and regular API workloads

Samuel Kung'u·Nov 20, 2024·12 min
#kubernetes#nginx#mlops#platform-engineering

The Problem

It was supposed to be a routine upgrade.

Instead, a single NGINX setting brought down our entire multi-tenant Kubernetes platform while deploying Kubeflow on MAOS (Multi-tenant Application Orchestration System).

We had just upgraded to ingress-nginx 4.14.0. Everything looked normal… until the controller started crashlooping.

Error: exit status 1
2025/11/20 20:11:38 [emerg] 36#36: "proxy_busy_buffers_size" must be less than 
the size of all "proxy_buffers" minus one buffer in /tmp/nginx/nginx-cfg2318361464:401
nginx: [emerg] "proxy_busy_buffers_size" must be less than the size of all 
"proxy_buffers" minus one buffer in /tmp/nginx/nginx-cfg2318361464:401
nginx: configuration file /tmp/nginx/nginx-cfg2318361464 test failed

The controller would start, attempt to reload its configuration, fail validation, and crash. Rinse and repeat. Our entire ingress layer was down — taking Kubeflow, ArgoCD, our monitoring stack and client-facing APIs with it.

The frustrating part? Our configuration looked perfectly reasonable:

proxy-buffer-size: "32k"
proxy-buffers: "8 32k"
proxy-busy-buffers-size: "192k"

Let me walk you through what we learned about NGINX proxy buffers, why this matters for MLOps and regular API workloads, and how to configure them properly.

Understanding NGINX Proxy Buffers

When NGINX proxies a request to an upstream service, it doesn't stream the response directly to the client. It first stores it in memory using buffers. This allows NGINX to:

  1. Free up the backend connection faster
  2. Handle slow clients better
  3. Apply caching, rate limits, and transformations

The main players

  • proxy_buffer_size → Buffer for response headers
  • proxy_buffers → Buffers for response body
  • proxy_busy_buffers_size → Max memory in use while sending to the client
  • client_body_buffer_size → Buffer for inbound request body

The critical NGINX rule

proxy_busy_buffers_size < (total proxy_buffers - one buffer)

NGINX needs at least one free buffer while other buffers are sending data to the client.

The math that looked right

proxy-buffers: "8 32k"         # 256k total
proxy-busy-buffers-size: 192k  # Looks valid

By simple math, this should have passed validation.

But NGINX doesn't just check the total — it also validates how proxy_buffer_size relates to the unit size in proxy_buffers.

In our case, both were set to 32k, and that overlap triggers a known internal validation conflict where NGINX miscalculates available space.

So although the math looked correct, the configuration was still considered invalid and rejected at load time.

This is documented (and complained about) in this issue: 👉 https://github.com/kubernetes/ingress-nginx/issues/13598

MLOps Workloads vs Regular APIs

This isn't just a Kubeflow problem.

Regular APIs

  • Request body: 1KB – 5MB
  • Response body: 1KB – 500KB
  • Headers: 1-4KB (Auth tokens, tracing, cookies)

Even "normal" microservices can break with the wrong buffers when:

  • Returning large JSON responses
  • Using big headers (JWT, OAuth, cookies)
  • Uploading files

MLOps / Kubeflow / MLflow

  • Model uploads: 100MB – 10GB
  • Notebook files: 1–50MB
  • Pipelines: 500KB – 5MB
  • Artifacts: 1GB+

This is where default settings absolutely break without tuning.

But here's the key mistake: We tried to tune buffers globally.

The Trap: Global Configuration

We originally put this in the NGINX controller:

config:
  proxy-buffer-size: "32k"
  proxy-buffers: "8 32k"
  proxy-busy-buffers-size: "192k"

Why this is dangerous

  1. Helm chart defaults override parts of your config
  2. Every connection gets these buffers (even /healthz)
  3. Different apps need different memory strategies
  4. One mistake can break the entire ingress

What we wanted vs what we got:

# Expected
proxy-buffer-size: "16k"
proxy-buffers: "4 64k"

# Actual (from ConfigMap)
proxy-buffer-size: "32k"
proxy-buffers: "8 32k"
proxy-busy-buffers-size: "192k"

Boom. CrashLoop.

The Real Solution: Per-Ingress Configuration

Instead of global tuning, move buffer settings to each Ingress object.

Kubeflow / ML workloads

annotations:
  nginx.ingress.kubernetes.io/proxy-body-size: "10G"
  nginx.ingress.kubernetes.io/proxy-buffer-size: "16k"
  nginx.ingress.kubernetes.io/proxy-buffers-number: "8"
  nginx.ingress.kubernetes.io/proxy-busy-buffers-size: "128k"
  nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
  nginx.ingress.kubernetes.io/client-body-buffer-size: "128k"

ArgoCD

annotations:
  nginx.ingress.kubernetes.io/proxy-body-size: "100m"
  nginx.ingress.kubernetes.io/proxy-buffer-size: "8k"
  nginx.ingress.kubernetes.io/ssl-passthrough: "true"

Regular APIs

annotations:
  kubernetes.io/ingress.class: nginx
  # Defaults are enough

Now each service defines what it actually needs — no more over-tuning or global risk.

Why This Works Better

Benefit Why it matters
Safer No global crash risk
Efficient Only heavy apps use big memory
Isolated One service can't kill the platform
Readable Ingress = documentation
Upgrade safe Helm defaults can change freely

Memory Example

Strategy Memory usage
Global buffers ~32MB / 1000 connections
Per-ingress buffers ~6–9MB total
Default only ~4MB

That's a massive improvement.

Final Production Setup

NGINX controller

controller:
  config:
    use-forwarded-headers: "true"
    enable-real-ip: "true"
    proxy-real-ip-cidr: <VPC CIDR>
    client-body-buffer-size: "128k"
    large-client-header-buffers: "4 32k"

Kubeflow

annotations:
  nginx.ingress.kubernetes.io/proxy-body-size: "10G"
  nginx.ingress.kubernetes.io/proxy-buffer-size: "16k"
  nginx.ingress.kubernetes.io/proxy-read-timeout: "600"

ArgoCD

nginx.ingress.kubernetes.io/proxy-body-size: "100m"

Normal apps

No change

Lessons Learned

  • ✅ Defaults are smarter than you think
  • ✅ Global buffer tuning is dangerous
  • ✅ MLOps and APIs need different profiles
  • ✅ Granularity = stability

Closing

Samuel, Founder & CTO of MAOS (Multi-tenant Application Orchestration System), is building a Kubernetes-as-a-Service platform that provisions dedicated, production-ready EKS clusters with full MLOps stacks, GitOps, monitoring, and scaling built in from day one.

If you're building or operating a platform for multiple teams and tenants, this kind of failure is inevitable — unless you design for isolation, intent, and granularity from the start.

Follow the journey or contribute at maosproject.io

💡

Building a platform?

MAOS automates infrastructure provisioning with battle-tested configurations for multi-tenant Kubernetes platforms.

Join the MAOS waitlist

Get early access to dedicated EKS clusters with GitOps, monitoring, and MLOps built in