Managing for Bursty Load
- Adjust
min_workers: This will change the number of managed inactive workers, and increase capacity for high peak - Check
max_workers: Ensure this parameter is set high enough for the serverless engine to create the necessary number of workers
Managing for Low Demand or Idle Periods
- Adjust
min_load: Reducingmin_loadwill reduce the minimum number of active workers. Set to1to reduce the number to its minimum value of 1 worker, or set to0to put all workers into inactive states. - Adjust
min_workers: This will change the number of managed inactive workers
Scaling to Zero
To allow your endpoint to fully scale to zero during idle periods, configureinactivity_timeout alongside your other scaling parameters. The inactivity_timeout value (in seconds) determines how long the endpoint must be idle before scaling down is permitted.
- To scale to zero active workers (while keeping cold workers available): set
min_load = 0and configure a positiveinactivity_timeout. Workers in thecold_workerspool will remain available for fast reactivation. - To scale to zero total workers: set
min_load = 0,cold_workers = 0, and configure a positiveinactivity_timeout. This minimizes cost during extended idle periods but incurs cold-start latency when traffic resumes. - To prevent scaling to zero regardless of other settings: set
inactivity_timeoutto a negative value (e.g.,-1).
0 for inactivity_timeout disables inactivity-based gating entirely — the endpoint will rely solely on normal autoscaling decisions.
Managing Queue Time
Usemax_queue_time and target_queue_time to control how the autoscaler responds to request queuing:
- Increase
max_queue_timeto allow more requests to buffer on each worker before the system holds them in the global queue. This is useful for workloads with predictable, longer processing times. - Decrease
target_queue_timeto trigger more aggressive scale-up when queue times rise, reducing latency at the cost of potentially higher worker counts. - Increase
target_queue_timeto tolerate higher queue times before scaling up, reducing costs when some latency is acceptable.