Production Operations Deployment Containers Scaling and Runbooks

Reading time
11 min read
Word count
2116 words
Diagram count
0 diagrams

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/nodejs-v8-runtime-engineering/17 Production Operations Deployment Containers Scaling and Runbooks.md.

Purpose: Provide an operations field manual for Node.js V8 Runtime Engineering covering deployments, production Linux hosts, containers, clusters, scaling, readiness, graceful shutdown, incident runbooks, and the security links to 16 Security Permissions Crypto Secrets Sandboxing and Dependency Risk plus ecosystem choices in 18 Node.js Ecosystem Frameworks Tooling and Learning Projects.

17 Production Operations Deployment Containers Scaling and Runbooks

Related: Node.js V8 Runtime Engineering, 16 Security Permissions Crypto Secrets Sandboxing and Dependency Risk, 17 Production Operations Deployment Containers Scaling and Runbooks, 18 Node.js Ecosystem Frameworks Tooling and Learning Projects, 10 Filesystem Processes Signals Workers Cluster and Child Processes, 11 Networking HTTP TLS DNS Sockets Undici and Fetch, 14 Observability Diagnostics Inspector Tracing Profiling and Core Dumps, 15 Performance Engineering Benchmarking Flamegraphs GC and Event Loop Latency

Operational stance

Production Node.js is a fleet problem, not a node server.js problem. The runtime must receive signals, drain connections, respect cgroup memory, expose health, emit structured logs, report version and build identity, survive dependency failures, and be debuggable without turning an incident into a data leak.

The four operating environments are different:

EnvironmentPrimary goalWhat is realWhat is misleading
Local learning machineunderstand runtime behavior and APIsdebugger, REPL, test runner, small profilesCPU limits, DNS, TLS roots, cgroups, load balancer behavior
Production Linux hostsupervised long-running processsystemd, journald, ulimit, coredumps, kernel TCP statecluster probes, pod eviction, service mesh policy
Production containerreproducible runtime artifactimage layers, PID 1 behavior, cgroups, read-only filesystemsfull host tools, mutable installs, local shell assumptions
Production clusterrolling change under loadreadiness, liveness, service discovery, autoscaling, network policysimple process-level debugging, fixed local IPs

Release selection

The official Node.js release page says production applications should only use Active LTS or Maintenance LTS releases. Current releases are useful for validating upcoming behavior and using new APIs, but production fleets should treat LTS as the default unless the service has a documented exception and an upgrade rollback plan.

ChoiceUse whenRunbook requirement
Active LTSdefault production servicesnormal patch cadence
Maintenance LTSstable services near upgrade windowscheduled migration before end of support
Currentearly API adoption or compatibility testingexplicit owner, rollback, and security monitoring
EOLnever for productionemergency upgrade plan

Runtime metadata to expose:

import process from 'node:process';

export function runtimeInfo() {
  return {
    node: process.version,
    v8: process.versions.v8,
    uv: process.versions.uv,
    openssl: process.versions.openssl,
    modules: process.versions.modules,
    platform: process.platform,
    arch: process.arch,
    pid: process.pid,
  };
}

Expose this in an authenticated diagnostics endpoint or startup log, not a public unauthenticated endpoint.

Build pipeline

A production build should be reproducible from source, lockfile, build image, and environment.

source checkout
  verify package manager version
  frozen install
  lint
  typecheck
  unit tests
  integration tests
  build
  generate SBOM
  build image
  scan image
  smoke test image
  sign or attest artifact
  deploy by immutable digest
StageFailure that should stop deploy
installlockfile drift, registry auth failure, install script failure
lint and typecheckgenerated code drift, unsafe API, module mismatch
testsunit, integration, contract, migration, and smoke failures
buildmissing assets, incorrect target, native addon compile error
SBOMmissing lockfile or unsupported package graph
image scancritical reachable runtime vulnerability
smoke testprocess cannot boot, health endpoint fails, signal drain fails
deployreadiness never turns true, error budget burn, migration lock

Example CI commands:

npm ci
npm run lint
npm run typecheck
node --test --test-randomize
npm run build
npm sbom --sbom-format=cyclonedx --json > sbom.cdx.json

Container image shape

The official Node Docker image project recommends running Node directly in CMD rather than using npm start so the Node process receives exit signals directly. Use multi-stage builds to keep build tools out of the runtime image.

Example Dockerfile:

FROM node:24-bookworm-slim AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci

FROM node:24-bookworm-slim AS build
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build
RUN npm prune --omit=dev

FROM node:24-bookworm-slim AS runtime
ENV NODE_ENV=production
WORKDIR /app
USER node
COPY --chown=node:node package.json package-lock.json ./
COPY --chown=node:node --from=build /app/node_modules ./node_modules
COPY --chown=node:node --from=build /app/dist ./dist
CMD ["node", "dist/server.mjs"]

Image checklist:

ControlReason
immutable tag or digest in deploymentavoids surprise rebuilds under same tag
NODE_ENV=productionmany frameworks enable production behavior
direct node commandclean signal delivery
non-root userreduces filesystem and host blast radius
no package manager needed at runtimesmaller attack surface
no source maps publicly servedavoid leaking source unless intentionally private
read-only root filesystemcatches accidental writes
/tmp scratch mountonly intentional writable path
CA bundle presentoutbound TLS works
timezone policy explicitlogs and scheduling are predictable

Configuration

Configuration should be explicit, typed, validated once at startup, and visible in redacted diagnostics.

function required(name) {
  const value = process.env[name];
  if (!value) throw new Error(`missing env ${name}`);
  return value;
}

function integer(name, fallback) {
  const raw = process.env[name] ?? String(fallback);
  const value = Number(raw);
  if (!Number.isInteger(value) || value < 0) {
    throw new Error(`invalid env ${name}`);
  }
  return value;
}

export const config = Object.freeze({
  nodeEnv: required('NODE_ENV'),
  port: integer('PORT', 3000),
  databaseUrl: required('DATABASE_URL'),
  shutdownMs: integer('SHUTDOWN_MS', 30000),
});

Rules:

RuleReason
validate before listeningfail fast before receiving traffic
redact secrets in config dumpsdiagnostics are often shared
keep config immutableprevents runtime drift
distinguish missing from emptyempty secrets are usually bugs
parse numbers and booleans oncestring truthiness causes outages
include config versionhelps correlate deploys

Process lifecycle

Node process shutdown is cooperative. process.on('exit') handlers can only do synchronous work because the event loop is already ending. Graceful shutdown belongs in signal handlers such as SIGTERM.

Minimal HTTP drain:

import http from 'node:http';
import process from 'node:process';

const server = http.createServer(app);
const sockets = new Set();
let shuttingDown = false;

server.on('connection', (socket) => {
  sockets.add(socket);
  socket.on('close', () => sockets.delete(socket));
});

server.listen(process.env.PORT ?? 3000);

async function shutdown(signal) {
  if (shuttingDown) return;
  shuttingDown = true;
  console.log(JSON.stringify({ level: 'info', event: 'shutdown_start', signal }));

  server.close((err) => {
    if (err) {
      console.error(JSON.stringify({ level: 'error', event: 'server_close_error', message: err.message }));
      process.exitCode = 1;
    }
  });

  setTimeout(() => {
    for (const socket of sockets) socket.destroy();
    process.exit();
  }, 30_000).unref();
}

process.on('SIGTERM', shutdown);
process.on('SIGINT', shutdown);

Production additions:

ConcernAdd
keep-alive drainsstop accepting, close idle, cap active deadline
queue workersstop polling, finish or requeue active jobs
databasestop new transactions, close pool after requests drain
telemetryflush spans and logs with short timeout
readinessmark not ready before closing
second signalforce exit

Health endpoints

Separate liveness from readiness.

EndpointMeaningShould check
/livezprocess should be restarted if falseevent loop wedged, fatal internal state
/readyzprocess can receive trafficstartup complete, dependencies needed for requests available
/healthzhuman summary if usedversion, uptime, degraded dependencies

Example:

import { monitorEventLoopDelay } from 'node:perf_hooks';

const loopDelay = monitorEventLoopDelay({ resolution: 20 });
loopDelay.enable();

let ready = false;

export function markReady() {
  ready = true;
}

export function healthHandler(req, res) {
  if (req.url === '/livez') {
    const p99Ms = loopDelay.percentile(99) / 1e6;
    res.writeHead(p99Ms > 5000 ? 500 : 200);
    res.end('ok');
    return;
  }

  if (req.url === '/readyz') {
    res.writeHead(ready ? 200 : 503);
    res.end(ready ? 'ready' : 'not ready');
    return;
  }

  res.writeHead(404);
  res.end();
}

Avoid deep dependency checks in liveness. A database outage should not cause every pod to restart.

Kubernetes deployment posture

Kubernetes docs describe Pods as the unit that carries containers, and resource requests and limits as how CPU and memory needs are declared. For Node services, requests and limits must be aligned with V8 heap, native memory, buffers, and sidecars.

Example deployment shape:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      terminationGracePeriodSeconds: 45
      containers:
        - name: api
          image: registry.example.com/api@sha256:example
          ports:
            - containerPort: 3000
          env:
            - name: NODE_ENV
              value: production
            - name: NODE_OPTIONS
              value: --max-old-space-size=768
          readinessProbe:
            httpGet:
              path: /readyz
              port: 3000
            periodSeconds: 5
            failureThreshold: 2
          livenessProbe:
            httpGet:
              path: /livez
              port: 3000
            periodSeconds: 10
            failureThreshold: 3
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: "1"
              memory: 1Gi
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            runAsNonRoot: true
            capabilities:
              drop: ["ALL"]

Memory rule of thumb:

ComponentCounts toward container memory
V8 old spaceyes
young generationyes
Buffers and external memoryyes
native addon memoryyes
stacks and code pagesyes
OpenSSL allocationsyes
sidecar memoryseparate container, same pod scheduling pressure

Set --max-old-space-size below the container memory limit so there is headroom for external memory and native overhead. If the limit is 1 GiB, a 768 MiB old-space cap may still be too high for buffer-heavy services.

Scaling

Node's single JavaScript thread makes the scaling axis explicit.

BottleneckFirst scale actionLater action
JavaScript CPUmore processes or podsworker pool for isolated CPU path
event loop blockingremove sync codeisolate heavy route or queue
DB latencyconnection pool and query tuningread replicas or cache
outbound APItimeout, retry, circuit breakerqueue, bulkhead, provider quota
memoryfix retention, cap cacheslarger pod only after evidence
libuv threadpooltune UV_THREADPOOL_SIZE for crypto/fs/zlib workloadmove work to dedicated service
network socketskeep-alive and agent limitsload balancer and kernel tuning

Use horizontal scaling for independent request concurrency. Use worker threads for CPU-bound JavaScript. Use cluster or multiple processes when process isolation and multi-core use on one host matters. In Kubernetes, prefer multiple pods over in-process cluster unless there is a clear host-level reason.

Connection pools

PoolCommon failureGuardrail
databaseevery pod opens max connections and overwhelms DBcalculate max pool per pod times replicas
HTTP clientunbounded sockets to upstreamconfigure agent or Undici dispatcher limits
Redisreconnect stormjittered backoff and max retry budget
worker threadstoo many workers for CPUfixed pool near available cores
libuv threadpoolcrypto and fs starve each othermeasure queueing before raising size

Pool sizing example:

database max connections: 300
reserved for admin and migrations: 30
available for app: 270
max pods during rollout: 12
safe pool per pod: floor(270 / 12) = 22
chosen pool per pod: 20

Timeouts, retries, and backpressure

Every outbound call should have a timeout. Every retry should have a budget. Every queue should have a max depth or admission policy.

LayerControl
inbound HTTPheader timeout, request timeout, body size limit
route handlerper-request deadline propagated by AbortSignal
outbound HTTPconnect and response timeout
databasestatement timeout and pool acquisition timeout
queuevisibility timeout and dead-letter policy
retriesexponential backoff with jitter and max attempts
circuit breakerfail fast during dependency outage

Example fetch with deadline:

export async function fetchJson(url, { timeoutMs = 3000 } = {}) {
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), timeoutMs);

  try {
    const response = await fetch(url, { signal: controller.signal });
    if (!response.ok) throw new Error(`upstream ${response.status}`);
    return await response.json();
  } finally {
    clearTimeout(timeout);
  }
}

Logging

Cluster logging should survive container crashes and node loss. Kubernetes docs describe cluster-level logging as separate storage and lifecycle independent of nodes, pods, or containers. Node services should write structured logs to stdout and stderr, then let the platform collect them.

Log fields:

FieldReason
timestampordering across systems
levelfiltering
message or eventhuman and machine meaning
request idsingle request correlation
trace iddistributed tracing
user or tenant idauthorization and impact analysis
route or operationaggregate by behavior
duration mslatency
status codeoutcome
version and poddeploy correlation

Avoid:

AvoidWhy
secretslogs are widely replicated
full request bodiesPII and cost
unbounded objectscircular refs, huge logs
multi-line stack blobs without structurehard parsing
per-request debug logs by defaultcost and noise

Deployment strategies

StrategyUse whenRisk
rolling updatenormal stateless serviceslow error detection can affect many users
blue-greenfast rollback and immutable environmentsdouble capacity and state compatibility
canaryhigh-risk behavior changeneeds metrics and traffic routing
feature flagdecouple deploy from releasestale flags become complexity
shadow trafficcompare behavior without user impactprivacy and side effects must be controlled

Pre-deploy checklist:

CheckEvidence
migrations backward compatibleold and new app can run together
readiness gate worksnew pod does not receive traffic before warmup
graceful shutdown worksold pod drains before kill
dashboards updatednew version, saturation, errors, latency
rollback command knownexact artifact digest or previous release
runbook linkedon-call can act without source archaeology

Database migrations

Safe migration sequence:

expand schema
deploy code that writes old and new or tolerates both
backfill
switch reads
stop old writes
contract schema

Footguns:

FootgunOutage pattern
app deploy and destructive migration togetherold pods crash during rolling update
long lock in migrationp99 latency spike and connection pileup
migration in every pod startuprace or repeated work
no migration timeoutstuck deploy blocks rollout
no rollback storypartial schema state traps the service

Runbook template

Service:
Owner:
Dashboard:
Logs:
Trace query:
SLO:
Dependencies:
Rollback command:
Feature flags:
Recent deploy command:

Symptoms:
First checks:
Containment:
Diagnosis:
Recovery:
Verification:
Escalation:
Post-incident notes:

Every runbook should have commands, not only prose. Every command should state whether it is read-only or mutating.

Incident runbooks

High CPU

StepAction
confirmCPU saturation, throttling, request rate, version
correlatedeploy, traffic mix, dependency outage
profileCPU profile or sampling profiler on one replica
containscale horizontally if event loop still makes progress
mitigatedisable feature flag or roll back
verifyp99 latency, event loop delay, CPU per request

Common causes:

CauseEvidence
sync JSON or regex pathhot function in profile
compression in appzlib CPU and response route
log serializationlogger stack in profile
crypto hash burstlibuv and CPU pressure
bad retry loophigh outbound attempt count

High memory or OOMKilled

StepAction
confirmRSS, heapUsed, external, container limit
identifyV8 heap versus external memory
captureheap snapshot only if safe for PII
containrollback or reduce traffic
tuneold-space cap only after identifying memory class
verifystable RSS plateau and GC behavior

Troubleshooting:

SignalInterpretation
heapUsed climbs with object countJavaScript retention
RSS climbs but heap stableBuffer, native, OpenSSL, allocator, fragmentation
OOM with low average memoryspike, burst, or limit too tight
GC pauses riseallocation churn or heap pressure
old-space near capincrease only if workload legitimately needs it

Event loop latency

StepAction
confirmmonitorEventLoopDelay p99, request latency
inspectCPU profile and route correlation
searchsync APIs, large JSON, regex, compression
containshift traffic or disable path
fixmove CPU work to worker, stream, or queue

Dependency outage

StepAction
identifywhich upstream, error class, timeout class
containcircuit breaker, disable feature, serve cache
reducelower retry attempts and concurrency
communicatestatus page or internal incident channel
recoverrestore normal retry and cache after provider healthy

Bad deploy

StepAction
detecterror rate, latency, readiness failures
stoppause rollout
rollbackdeploy previous digest
verifycompare SLO and logs to baseline
preservekeep failed version logs and image digest
follow upregression test and deploy guard

Troubleshooting table

SymptomFirst checksLikely fix
pod never readyconfig validation, dependency check, startup logsseparate readiness from deep dependency checks
pod restarts during deployliveness too aggressive, boot slower than probeadd startup probe or adjust thresholds
SIGTERM loses requestsapp exits before drainimplement graceful shutdown and readiness flip
image works locally, fails in clusteruser, CA, DNS, env, read-only filesystemreproduce with same image and env
p99 spikes after scale-outDB pool multiplied by replicaslower per-pod pool
CPU throttled despite low averagelimit too low for burstsadjust CPU requests and limits based on p99
memory OOM before heap capexternal memory or native overheadlower old-space cap or fix buffers
logs missing after crashlocal file logswrite to stdout and use cluster logging
rollback failsmigration not backward compatibleexpand-contract migration discipline

Production acceptance checklist

AreaRequired evidence
buildfrozen install, tests, artifact digest
runtimeLTS Node version or documented exception
imagenon-root, direct Node command, no dev deps
configstartup validation and redaction
healthseparate live and ready endpoints
signalsgraceful shutdown tested
resourcesmemory and CPU requests and limits based on measurement
observabilitylogs, metrics, traces, version labels
securitysecrets not in image, egress controlled, dependency scan
scalingpool math accounts for max rollout replicas
runbookhigh CPU, memory, dependency outage, bad deploy