Scheduling Resources Requests Limits QoS and Autoscaling

Reading time
10 min read
Word count
1802 words
Diagram count
3 diagrams

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/kubernetes/08 Scheduling Resources Requests Limits QoS and Autoscaling.md.

Purpose: Explain how Kubernetes places Pods, reserves resources, enforces limits, prioritizes workloads, spreads risk, and scales Pods or nodes under real production constraints.

Scheduling, resources, requests, limits, QoS, and autoscaling

Scheduling starts after a controller from 03 Deployments ReplicaSets StatefulSets DaemonSets Jobs and CronJobs creates a Pod using primitives from 02 Containers Pods and Workload Primitives. The scheduler filters nodes that cannot run the Pod, scores feasible nodes, binds the Pod, and then kubelet enforces cgroups, pulls images, mounts volumes, and reports status.

Rendering diagram...

Requests and limits

Requests are scheduling reservations. Limits are runtime ceilings. A Pod is scheduled based on the sum of its container requests, plus init container rules and Pod overhead when applicable.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
spec:
  replicas: 4
  selector:
    matchLabels:
      app.kubernetes.io/name: payments-api
  template:
    metadata:
      labels:
        app.kubernetes.io/name: payments-api
    spec:
      containers:
        - name: api
          image: ghcr.io/example/payments-api:2.8.1
          resources:
            requests:
              cpu: 250m
              memory: 256Mi
            limits:
              memory: 512Mi
ResourceRequest behaviorLimit behaviorProduction guidance
CPUUsed by scheduler and autoscalers as reserved CPUThrottles CPU time when exceededSet requests from observed steady usage plus headroom. Avoid CPU limits for latency-sensitive services unless required.
MemoryUsed by scheduler as reserved memoryContainer can be killed with OOMKilled when exceededAlways set memory requests and limits for most app workloads.
Ephemeral storageUsed by scheduler if requestedPod can be evicted when exceededSet for log-heavy, cache-heavy, and batch workloads.
Extended resourcesAdvertised by device pluginsUsually request equals limitUse for GPU, FPGA, DPU, and vendor-specific devices.

CPU units:

  • 1000m equals one CPU core.
  • 250m equals one quarter of a core.
  • CPU is compressible, so exceeding available CPU usually slows work rather than killing it.

Memory units:

  • Mi and Gi are binary units.
  • Memory is not compressible from Kubernetes' point of view. Exceeding a cgroup limit can kill the container.

CPU throttling and OOMKilled

CPU throttling happens when a container wants more CPU than its limit allows. It can cause high latency while dashboards show low average CPU because throttling is time-based.

kubectl top pod -n prod
kubectl describe pod -n prod <pod>
kubectl get pod -n prod <pod> -o jsonpath='{.status.containerStatuses[*].lastState.terminated.reason}'

Signals to inspect:

  • Container metrics: CPU usage, throttled periods, throttled seconds.
  • App metrics: latency, queue depth, request timeout, worker backlog.
  • Node metrics: CPU saturation and noisy neighbors.
  • Deployment settings: CPU limit relative to request and peak demand.

OOMKilled means the kernel killed the container because it exceeded memory cgroup limits or node memory pressure selected it for eviction.

Debug sequence:

kubectl describe pod -n prod <pod>
kubectl logs -n prod <pod> -c <container> --previous
kubectl get events -n prod --field-selector involvedObject.name=<pod> --sort-by=.lastTimestamp
kubectl top pod -n prod <pod> --containers

Fixes:

  • Reduce memory spikes in application code.
  • Increase memory request and limit based on observed peak plus headroom.
  • Split batch work into smaller chunks.
  • Move caches to bounded sizes.
  • Check sidecars and init containers, not just the main container.

QoS classes

Kubernetes assigns Pod QoS from resource specifications. QoS affects eviction order under node pressure.

QoSRequirementsEviction priorityNotes
GuaranteedEvery container has CPU and memory request equal to limitLast among normal PodsStrongest node-pressure protection but can reduce bin packing flexibility.
BurstableAt least one CPU or memory request is set, but not all requests equal limitsMiddleCommon production default.
BestEffortNo CPU or memory requests or limitsFirstAvoid for production workloads.

Example Guaranteed Pod:

resources:
  requests:
    cpu: "1"
    memory: 1Gi
  limits:
    cpu: "1"
    memory: 1Gi

Example Burstable Pod:

resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    memory: 512Mi

LimitRange and ResourceQuota

LimitRange sets namespace defaults, minimums, maximums, and ratios. ResourceQuota caps aggregate usage in a namespace.

apiVersion: v1
kind: LimitRange
metadata:
  name: default-container-limits
  namespace: prod
spec:
  limits:
    - type: Container
      defaultRequest:
        cpu: 100m
        memory: 128Mi
      default:
        memory: 256Mi
      min:
        cpu: 25m
        memory: 64Mi
      max:
        cpu: "2"
        memory: 4Gi
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: prod-compute
  namespace: prod
spec:
  hard:
    requests.cpu: "40"
    requests.memory: 80Gi
    limits.memory: 160Gi
    pods: "200"

Commands:

kubectl describe limitrange -n prod
kubectl describe resourcequota -n prod
kubectl get resourcequota -n prod -o yaml

Tradeoffs:

ControlBenefitRisk
LimitRange defaultsPrevents BestEffort Pods by accidentBad defaults hide missing app-specific sizing.
LimitRange maxStops runaway specsBlocks legitimate large jobs unless exceptions exist.
ResourceQuotaProtects shared clusters and budgetsCan block deploys if teams do not monitor quota headroom.

PriorityClass and preemption

PriorityClass tells the scheduler which Pods matter more. If no feasible node exists, a high-priority Pod may preempt lower-priority Pods.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-api
value: 100000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Critical user-facing APIs."
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
spec:
  template:
    spec:
      priorityClassName: critical-api

Guidance:

  • Use few priority tiers. Too many values become meaningless.
  • Reserve very high priorities for cluster-critical and business-critical services.
  • Preemption is not instant capacity creation. Victim Pods need termination time, PDBs may constrain disruption, and replacement Pods may cause more pressure.
  • For nonpreempting priority, set preemptionPolicy: Never when queue order matters but eviction of lower-priority Pods is not acceptable.

Node selectors and node affinity

nodeSelector is exact-match placement. Node affinity adds expressive rules and soft preferences.

spec:
  nodeSelector:
    workload.example.com/pool: general
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values: ["us-east-1a", "us-east-1b"]
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 50
          preference:
            matchExpressions:
              - key: node.kubernetes.io/instance-type
                operator: In
                values: ["m7i.large"]

Rules:

  • Required rules are hard filters at scheduling time.
  • Preferred rules affect scoring only.
  • IgnoredDuringExecution means Pods are not evicted if labels later change.
  • Keep node labels governed. Placement based on ad hoc labels becomes fragile.

Pod affinity and anti-affinity

Pod affinity places Pods near Pods with selected labels. Pod anti-affinity keeps Pods apart.

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/name: payments-api
          topologyKey: kubernetes.io/hostname

Use cases:

RuleUseRisk
Required anti-affinity by hostnameKeep replicas off the same nodeCan make rollouts unschedulable in small clusters.
Preferred anti-affinity by zoneSpread replicas across failure domainsScheduler may co-locate if capacity is tight.
Pod affinityCo-locate chatty services or cache clientsCan create hotspots and noisy neighbor coupling.

Prefer topology spread constraints for many replica-spreading requirements because they are more direct and often easier to reason about.

Taints and tolerations

Taints repel Pods. Tolerations allow Pods to schedule onto tainted nodes, but they do not force placement.

kubectl taint nodes node-1 dedicated=payments:NoSchedule
kubectl label node node-1 workload.example.com/pool=payments
spec:
  tolerations:
    - key: dedicated
      operator: Equal
      value: payments
      effect: NoSchedule
  nodeSelector:
    workload.example.com/pool: payments

Effects:

EffectMeaning
NoScheduleNew Pods without toleration will not schedule there.
PreferNoScheduleScheduler tries to avoid the node.
NoExecuteExisting Pods without toleration can be evicted.

Pattern: use taint plus toleration plus node affinity or selector for dedicated nodes. A toleration alone only permits placement; it does not request it.

Topology spread constraints

Topology spread constraints distribute Pods across zones, nodes, racks, or custom domains.

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app.kubernetes.io/name: payments-api
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app.kubernetes.io/name: payments-api

Guidance:

  • Use zone spread for availability.
  • Use hostname spread to reduce node failure blast radius.
  • Use DoNotSchedule when imbalance is worse than delay.
  • Use ScheduleAnyway when availability of a new replica matters more than perfect spread.
  • Review constraints with autoscaler behavior. Strict spread can require node creation in specific zones.

RuntimeClass and device plugins

RuntimeClass selects a container runtime handler. It is used for sandboxed runtimes, Windows nodes, Wasm runtimes, gVisor, Kata Containers, or other runtime-specific behavior.

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
---
apiVersion: v1
kind: Pod
metadata:
  name: sandboxed-worker
spec:
  runtimeClassName: gvisor
  containers:
    - name: worker
      image: ghcr.io/example/worker:1.0.0

Device plugins advertise extended resources such as GPUs.

resources:
  limits:
    nvidia.com/gpu: 1

Rules:

  • Extended resources usually require limit and request to match.
  • Nodes must run the vendor device plugin.
  • Capacity planning must include driver rollout, runtime compatibility, scheduling labels, and quota.

Horizontal Pod Autoscaler

HPA changes replica count based on metrics. It works well for stateless horizontally scalable workloads.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payments-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payments-api
  minReplicas: 4
  maxReplicas: 30
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 60

Commands:

kubectl get hpa -n prod
kubectl describe hpa -n prod payments-api
kubectl top pod -n prod -l app.kubernetes.io/name=payments-api

HPA depends on requests for utilization metrics. If CPU request is too low, HPA can scale too aggressively. If it is too high, HPA may under-scale.

Vertical Pod Autoscaler

VPA recommends or applies resource request changes based on observed usage. It is useful for right-sizing and for services with stable resource profiles.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: payments-api
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payments-api
  updatePolicy:
    updateMode: "Off"

Modes:

ModeBehaviorUse
OffRecommends onlySafest starting point and GitOps review input.
InitialSets requests only at Pod creationGood for jobs or services where restarts are controlled elsewhere.
Auto or RecreateEvicts Pods to apply recommendationsRequires PDBs, disruption tolerance, and review.

HPA and VPA can conflict when both act on CPU. A common pattern is HPA for replicas and VPA in recommendation-only mode for requests.

Cluster Autoscaler and Karpenter

Pod autoscalers change Pod count. Node autoscalers change cluster capacity.

AutoscalerModelStrengthWatchouts
Cluster AutoscalerAdjusts existing node groups based on unschedulable Pods and underutilized nodesMature and widely supportedNode group shapes must already exist. Scale-down can be blocked by PDBs, local storage, or system Pods.
KarpenterProvisions right-sized nodes from flexible constraintsFast, flexible bin packing and instance selectionRequires careful NodePool, disruption, consolidation, and cloud quota design.

Cluster Autoscaler flow:

Rendering diagram...

Karpenter overview:

Rendering diagram...

Node autoscaling guidance:

  • Requests must represent real need. Autoscalers cannot infer hidden memory or CPU requirements.
  • Strict node affinity, topology spread, taints, PVC zone binding, RuntimeClass, and device requests constrain scale-up options.
  • Reserve capacity for DaemonSets and system Pods.
  • Configure scale-down disruption windows for stateful and latency-sensitive workloads.
  • Monitor cloud quota, subnet IP exhaustion, image pull time, and node bootstrap time.

Scheduling failure troubleshooting

Start from events. Kubernetes usually tells you which predicate failed.

kubectl get pod -n prod <pod> -o wide
kubectl describe pod -n prod <pod>
kubectl get events -n prod --sort-by=.lastTimestamp
kubectl get nodes --show-labels
kubectl describe node <node>
kubectl get priorityclass
kubectl get pdb -n prod

Common messages:

Event text fragmentLikely causeNext check
Insufficient cpuRequests do not fit free allocatable CPUNode allocatable, DaemonSet overhead, pending Pods, autoscaler scale-up.
Insufficient memoryRequests do not fit free allocatable memoryMemory requests, node pools, quota, VPA recommendations.
had untolerated taintPod lacks toleration for tainted nodeTaints, tolerations, dedicated node policy.
didn't match Pod's node affinity/selectorHard node placement rule excludes nodesLabels, nodeSelector, required node affinity.
didn't match pod affinity/anti-affinityRequired Pod placement rule cannot be satisfiedExisting Pod labels, topologyKey, replica count, zones.
max node group size reachedNode autoscaler cannot add nodesNode group limits, cloud quota, Karpenter NodePool limits.
preemption is not helpfulEvicting lower-priority Pods would still not fitHard constraints, resource shape, volume zone, device availability.
pod has unbound immediate PersistentVolumeClaimsStorage binding blocks schedulingPVC, StorageClass binding mode, zone constraints.

Capacity planning

Capacity planning links application SLOs, workload requests, replica counts, rollout surge, disruption budgets, and node pool shapes.

Useful formulas:

steady_cpu_request = replicas * per_pod_cpu_request
steady_memory_request = replicas * per_pod_memory_request
rollout_peak_pods = replicas + maxSurge
node_reserved_capacity = kube_reserved + system_reserved + daemonset_requests
usable_node_capacity = allocatable - daemonset_requests - safety_headroom

Checklist:

  • Measure p50, p95, and peak CPU and memory per workload.
  • Separate request sizing from limit sizing.
  • Include sidecars, init containers, and DaemonSets.
  • Include rollout surge, HPA max replicas, CronJob overlap, and batch windows.
  • Keep node pools shaped for workload resource ratios. CPU-heavy and memory-heavy workloads should not always share the same pool.
  • Account for zones. A three-zone service should survive losing one zone without saturating the remaining zones.
  • Account for image pull, cold start, and warmup time in scale-up targets.
  • Track quota headroom for namespace ResourceQuota and cloud provider limits.

Common mistakes

MistakeSymptomCorrection
No requestsBestEffort Pods, bad bin packing, autoscaler blind spotsSet requests for every production container.
CPU limit too lowLatency spikes and throttling under loadRemove CPU limit or raise it after measuring throttling.
Memory limit too close to steady usagePeriodic OOMKilled during burstsIncrease limit or reduce peak memory behavior.
Hard anti-affinity in small clustersRollouts stuck PendingUse preferred anti-affinity or topology spread with realistic capacity.
Toleration without selectorWorkload can land on dedicated nodes accidentallyPair toleration with node affinity or selector.
HPA target without correct requestsOver-scaling or under-scalingCalibrate requests before trusting utilization.
VPA auto mode on fragile servicesEvictions cause incidentsStart with recommendation-only mode.
Ignoring DaemonSet overheadNew nodes cannot fit expected PodsSubtract per-node agents from allocatable capacity.

Review checklist

  • Every production container has CPU and memory requests.
  • Memory limits match observed peaks and failure tolerance.
  • CPU limits are justified for the workload class.
  • QoS class is intentional.
  • LimitRange and ResourceQuota enforce guardrails without hiding bad sizing.
  • PriorityClass and preemption policy are limited to clear service tiers.
  • Node selectors, affinity, taints, tolerations, topology spread, RuntimeClass, and device requests are reviewed together.
  • HPA max replicas and rollout maxSurge fit cluster and quota capacity.
  • VPA mode is compatible with disruption tolerance.
  • Cluster Autoscaler or Karpenter has node shapes that can satisfy real pending Pods.
  • Capacity models include DaemonSets, system reservations, zones, batch windows, and cold-start time.