Observability Diagnostics Inspector Tracing Profiling and Core Dumps

Reading time
12 min read
Word count
2266 words
Diagram count
0 diagrams

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/nodejs-v8-runtime-engineering/14 Observability Diagnostics Inspector Tracing Profiling and Core Dumps.md.

Purpose: Build a production diagnostic playbook for Node.js services that connects Node.js V8 Runtime Engineering, 15 Performance Engineering Benchmarking Flamegraphs GC and Event Loop Latency, diagnostics channels, async context, inspector sessions, trace events, diagnostic reports, heap snapshots, CPU profiles, and native core dumps into one evidence-first workflow.

14 Observability Diagnostics Inspector Tracing Profiling and Core Dumps

Operating model

Node.js diagnostics are strongest when each tool is treated as a different camera angle, not as a replacement for the others. Logs tell what application code believed was happening. Metrics tell how often and how severely it happened. Traces tell request causality across async boundaries. Inspector profiles tell where V8 and JavaScript spent CPU or retained heap. Trace events expose runtime timelines for libuv, V8, async hooks, and Node categories. Diagnostic reports snapshot process state. Core dumps preserve native memory after a hard abort.

Use this decision path:

SymptomFirst evidenceDeep evidenceLast resort
Slow endpointRequest histogram, trace span, event loop delayCPU profile, trace eventsNative profile or core if process wedges
Memory growthRSS, heap used, external memory, allocation rateHeap snapshot, heap profile, diagnostic reportCore dump with llnode or native debugger
CrashError log, exit code, diagnostic reportCore dump, native stack, fatal error reportReproduce under debug build
Context lossTrace parent gaps, missing request IDAsyncLocalStorage probes, diagnostics_channel spansAsync hooks investigation
Event loop stallperf_hooks histogram, blocked request spansCPU profile, sync I/O trace, flamegraphCore if stuck in native code
Library black boxdiagnostics_channel subscriptionOpenTelemetry instrumentation wrapperInspector breakpoint in staging

The field rule: start low overhead and always capture a timestamp, Node version, process ID, container ID, release SHA, command line, and clock source. A profile without the deployment identity is often only trivia.

Instrumentation layers

LayerNode primitiveProduction useRisk
Correlation contextAsyncLocalStorage from node:async_hooksRequest IDs, tenant IDs, trace correlationLeaking context with enterWith() or custom async boundaries
Library diagnosticsnode:diagnostics_channelPublish structured events without hard dependency on telemetry vendorsSynchronous subscriber cost
Manual telemetryOpenTelemetry APISpans, metrics, logs around business operationsSDK must initialize before instrumented modules
Runtime metricsnode:perf_hooks, process.memoryUsage(), v8.getHeapStatistics()Service SLOs and saturation signalsHigh cardinality labels and noisy intervals
Inspectornode:inspector and DevTools protocolCPU profiles, heap snapshots, protocol automationSecurity exposure and runtime overhead
Trace eventsnode:trace_events or CLI flagsRuntime timelines visible in Chrome trace viewerLarge files and category noise
Reportsprocess.report and report CLI flagsCrash and hang snapshotsSensitive environment and network metadata unless excluded
Core dumpsOS core plus Node flagsNative post-mortem when process abortsHuge artifacts with secrets and memory contents

Diagnostics channel as the library seam

diagnostics_channel is the right API when a library wants to publish diagnostic data without depending on OpenTelemetry, a logger, or a metrics backend. Channel names should be stable, documented, and namespaced by package or subsystem. Message shape is part of the library contract.

// payment-diagnostics.mjs
import diagnosticsChannel from 'node:diagnostics_channel';
import { performance } from 'node:perf_hooks';

const paymentAttempt = diagnosticsChannel.channel('acme.payment.attempt');

export async function chargeCard(input, gateway) {
  const start = performance.now();
  const message = {
    gateway: gateway.name,
    currency: input.currency,
    amount_minor: input.amountMinor,
  };

  if (paymentAttempt.hasSubscribers) {
    paymentAttempt.publish({ ...message, phase: 'start', time_ms: start });
  }

  try {
    const result = await gateway.charge(input);
    if (paymentAttempt.hasSubscribers) {
      paymentAttempt.publish({
        ...message,
        phase: 'end',
        status: result.status,
        duration_ms: performance.now() - start,
      });
    }
    return result;
  } catch (error) {
    if (paymentAttempt.hasSubscribers) {
      paymentAttempt.publish({
        ...message,
        phase: 'error',
        error_name: error.name,
        duration_ms: performance.now() - start,
      });
    }
    throw error;
  }
}

Subscriber pattern:

// telemetry-bootstrap.mjs
import diagnosticsChannel from 'node:diagnostics_channel';
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-diagnostics');

diagnosticsChannel.subscribe('acme.payment.attempt', (message) => {
  if (message.phase !== 'end' && message.phase !== 'error') return;

  const span = tracer.startSpan('payment.gateway.charge', {
    attributes: {
      'payment.gateway': message.gateway,
      'payment.currency': message.currency,
      'payment.amount_minor': message.amount_minor,
      'payment.duration_ms': message.duration_ms,
    },
  });

  if (message.phase === 'error') {
    span.setStatus({ code: SpanStatusCode.ERROR, message: message.error_name });
  }

  span.end();
});

Production guidance:

PracticeWhy it matters
Create channels once at module top levelDynamic channel lookup in hot code adds avoidable overhead and weakens naming discipline
Check hasSubscribers before building expensive messagesIt avoids allocation and serialization work when telemetry is disabled
Keep messages plain dataSubscribers may run in a different release or process shape than the publisher expects
Publish only bounded metadataDo not put request bodies, tokens, SQL text with literals, or customer secrets into diagnostic messages
Document versioned message schemasTelemetry consumers break when a field silently changes type or unit

Footguns:

FootgunFailure modeFix
Subscriber throwsPublisher path fails because subscribers run synchronouslyWrap subscriber code and treat telemetry errors as telemetry failures
Heavy message assembly before hasSubscribersInstrumentation changes latency even when unusedGuard before expensive work
Per-request channel namesCardinality explosion and memory churnEncode request data in messages, not channel names
Mixing unitsProfiles show milliseconds while histograms use nanosecondsPut unit suffixes in field names

TracingChannel for operation lifecycles

diagnostics_channel.tracingChannel(name) groups lifecycle channels for a single operation. It is useful for bridge instrumentation that wants start, end, async start, async end, and error semantics without hardcoding five independent names.

import diagnosticsChannel from 'node:diagnostics_channel';

const renderTrace = diagnosticsChannel.tracingChannel('acme.template.render');

export function renderTemplate(template, data) {
  return renderTrace.traceSync(() => {
    return template.render(data);
  }, {
    template_name: template.name,
    key_count: Object.keys(data).length,
  });
}

Use TracingChannel for reusable operation boundaries. Use plain channels for one-way events such as cache invalidation, retry scheduled, connection created, or feature flag evaluated.

Async context and trace correlation

Modern Node async context tracking should prefer stable AsyncLocalStorage for request-scoped values. Low-level async_hooks.createHook() remains powerful but has safety and performance risks. Use it for diagnostics experiments and framework internals, not as the default app-level context primitive.

// request-context.mjs
import { AsyncLocalStorage } from 'node:async_hooks';
import crypto from 'node:crypto';

export const requestContext = new AsyncLocalStorage();

export function withRequestContext(req, res, next) {
  const context = {
    request_id: req.headers['x-request-id'] || crypto.randomUUID(),
    route: req.route?.path || req.url,
    started_at_ms: Date.now(),
  };
  requestContext.run(context, next);
}

export function log(fields) {
  const context = requestContext.getStore();
  console.log(JSON.stringify({ ...context, ...fields }));
}

Context troubleshooting:

SymptomLikely causeProbe
Request ID disappears after queue callbackCustom queue did not preserve async contextWrap task execution in AsyncResource or capture with AsyncLocalStorage.snapshot()
All requests share one contextenterWith() used in a shared event handlerPrefer run() around a request boundary
Trace parent missing in manual spanSDK initialized too late or active context absentLog active span before and after the boundary
Worker thread has no contextAsync IDs and context are per threadPropagate trace and request fields in worker messages
Context survives after requestLong-lived promise or timer retained request storeCancel timers and avoid storing large objects in the context

OpenTelemetry context propagation depends on a context manager. In normal Node applications the SDK usually installs one, but custom setups must ensure a context manager is enabled before spans are created. Initialize instrumentation before application modules, commonly with node --import ./instrumentation.mjs app.js on supported Node versions.

OpenTelemetry bridge discipline

For applications, initialize the SDK. For libraries, depend only on the OpenTelemetry API and let the host application choose SDK, exporters, sampling, and resources.

// instrumentation.mjs
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter(),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

process.on('SIGTERM', async () => {
  await sdk.shutdown();
  process.exit(0);
});

Operational checklist:

CheckExpected state
Bootstrap orderInstrumentation file loads before app imports HTTP, database, queue, and RPC clients
Resource identityservice.name, service.version, deployment environment, instance ID, and region are set
Sampling policyHead sampling is explicit for high-volume services
PropagationIncoming and outgoing trace headers are tested at service boundaries
ShutdownSDK flushes before process exit during rolling deploys
CardinalityUser ID, URL with IDs, SQL literals, and raw error messages are not metric labels

Footgun: default all-span sampling is useful while learning but expensive at production scale. Set the sampling policy deliberately before a high-traffic rollout.

Inspector surfaces

The inspector exposes V8 and runtime state through the Chrome DevTools Protocol. It can be used interactively with --inspect or programmatically with node:inspector.

ModeCommand or APIUse
Interactive debugnode --inspect app.jsBreakpoints, heap snapshots, live inspection
Break on startnode --inspect-brk app.jsStartup bugs and module initialization
Programmatic CPU profilenode:inspector/promises Profiler.start and Profiler.stopShort profiles around a controlled workload
Programmatic heap snapshotHeapProfiler.takeHeapSnapshotCapture retainers during staged leak reproduction
Production emergencyBind inspector only to loopback or an isolated debug endpointShort diagnostic window with access controls

Programmatic CPU profile:

import { Session } from 'node:inspector/promises';
import fs from 'node:fs';

export async function captureCpuProfile(path, fn) {
  const session = new Session();
  session.connect();
  try {
    await session.post('Profiler.enable');
    await session.post('Profiler.start');
    const result = await fn();
    const { profile } = await session.post('Profiler.stop');
    fs.writeFileSync(path, JSON.stringify(profile));
    return result;
  } finally {
    session.disconnect();
  }
}

Programmatic heap snapshot:

import { Session } from 'node:inspector/promises';
import fs from 'node:fs';

export async function writeInspectorHeapSnapshot(path) {
  const session = new Session();
  const fd = fs.openSync(path, 'w');

  session.connect();
  session.on('HeapProfiler.addHeapSnapshotChunk', (message) => {
    fs.writeSync(fd, message.params.chunk);
  });

  try {
    await session.post('HeapProfiler.takeHeapSnapshot');
  } finally {
    session.disconnect();
    fs.closeSync(fd);
  }
}

Security rules:

RuleReason
Never expose inspector publiclyInspector can evaluate code and inspect secrets
Avoid long-lived production inspector sessionsProfiles, heap snapshots, and console object retention can change process behavior
Capture artifacts to encrypted storageProfiles and dumps may contain PII, tokens, SQL strings, and environment values
Prefer time-boxed debug pods or replicasKeep normal fleet instances predictable

Trace events

Trace events produce Chrome trace compatible timelines. They are valuable when a CPU profile alone cannot answer why work happened in a particular order.

CLI capture:

node \
  --trace-events-enabled \
  --trace-event-categories node.perf,node.async_hooks,v8 \
  --trace-event-file-pattern 'trace-${pid}-${rotation}.json' \
  server.mjs

Programmatic capture:

import { createTracing, getEnabledCategories } from 'node:trace_events';

const tracing = createTracing({
  categories: ['node.perf', 'node.async_hooks', 'v8'],
});

tracing.enable();
console.log(getEnabledCategories());

setTimeout(() => {
  tracing.disable();
}, 30_000);

Trace event use cases:

Use caseCategoriesNotes
Async causalitynode.async_hooksHigh volume in busy services
Runtime performancenode.perfUseful with perf_hooks marks and measures
V8 behaviorv8Correlate GC and compilation with latency
Startup analysisnode.bootstrap, v8Compare cold starts across releases

Troubleshooting:

ProblemCauseFix
Trace file too largeToo many categories or too long a windowUse a short capture window and narrow categories
No useful app labelsOnly runtime events were capturedAdd perf marks, diagnostics_channel messages, or trace spans around app operations
Timeline hard to readMultiple processes write overlapping filesInclude ${pid} in the file pattern

Diagnostic reports

Diagnostic reports are JSON snapshots of process state. They are useful during crashes, fatal errors, hangs, and manual incident capture.

Enable report generation:

node \
  --report-on-fatalerror \
  --report-uncaught-exception \
  --report-on-signal \
  --report-signal=SIGUSR2 \
  --report-dir=/var/log/node-reports \
  --report-exclude-env \
  --report-exclude-network \
  server.mjs

Manual capture:

import fs from 'node:fs';

export function writeIncidentReport(reason) {
  const safeReason = reason.replace(/[^a-z0-9_.-]/gi, '_').slice(0, 80);
  const path = `/var/log/node-reports/report-${process.pid}-${safeReason}.json`;
  process.report.writeReport(path);
  fs.chmodSync(path, 0o600);
  return path;
}

Report reading guide:

SectionWhat to inspect
HeaderNode version, command line, platform, report trigger
JavaScript stackThe active stack when the report was created
Native stackC++ and runtime frames, useful for native add-ons
Heap statisticsHeap limit, used heap, available heap
Resource usageCPU time, RSS, page faults, file descriptors
libuv handlesActive timers, TCP handles, async resources
Environment variablesUsually exclude in production artifacts

Reports are not heap snapshots. They summarize memory and handles but do not show object retainer paths. If the problem is a JavaScript leak, move from report to heap snapshot. If the process aborts in native code, move from report to core dump.

V8 heap snapshots and heap profiles

v8.writeHeapSnapshot() writes a heap snapshot for the current isolate. It is synchronous, can block the event loop, and can require memory roughly proportional to the live heap. That makes it dangerous under memory pressure.

import v8 from 'node:v8';
import fs from 'node:fs';

export function captureHeapSnapshot(label) {
  const path = v8.writeHeapSnapshot(`/var/log/node-heap/${label}-${process.pid}.heapsnapshot`);
  fs.chmodSync(path, 0o600);
  return path;
}

Production guidance:

ScenarioSafer approach
Suspected slow leakCapture snapshots on a canary at low traffic before and after growth
Near OOMPrefer diagnostic report and external memory metrics first
Worker thread leakTrigger snapshots per worker, not only from the main thread
Sensitive tenant dataEncrypt artifact storage and enforce short retention
Huge heapUse heap sampling or isolate a replica with a smaller workload

Heap snapshot workflow:

  1. Capture a baseline after warmup.
  2. Run a known workload.
  3. Capture a second snapshot.
  4. Compare retained size and object counts by constructor.
  5. Inspect dominators and retaining paths.
  6. Validate with a fixed build under the same workload.

Core dumps

Core dumps are for post-mortem debugging when JavaScript level artifacts are insufficient, especially fatal native crashes, aborts, C++ add-ons, corrupted heap, illegal instruction, or a process stuck below JavaScript.

Node flags and OS setup:

ulimit -c unlimited

node \
  --abort-on-uncaught-exception \
  --report-on-fatalerror \
  --report-dir=/var/log/node-reports \
  server.mjs

Container notes:

ConcernProduction answer
Core file locationConfigure host or container runtime core pattern deliberately
File sizeExpect large files, often near process virtual memory size
SecretsTreat cores as highly sensitive artifacts
SymbolsPreserve Node binary, native add-on builds, source maps, and release metadata
Runtime limitsEnsure ulimit, security profile, and writable dump path allow capture

Native post-mortem checklist:

  1. Record Node version, exact binary, container image digest, and architecture.
  2. Preserve the core, executable, native add-ons, and generated reports together.
  3. Inspect native stack with lldb, gdb, or platform debugger.
  4. If using llnode, match the Node and V8 version closely.
  5. Correlate crash time with OTel traces, logs, kernel messages, and deployment events.

Incident recipes

Recipe: unexplained p99 spike

  1. Confirm p99 increase in request histogram.
  2. Check event loop delay histogram and event loop utilization from 15 Performance Engineering Benchmarking Flamegraphs GC and Event Loop Latency.
  3. Capture a 30 second CPU profile on one hot replica.
  4. Capture trace events with node.perf and v8 if CPU profile shows runtime noise.
  5. Compare spans for slow and normal requests.
  6. Look for sync I/O, JSON serialization, regex backtracking, compression, crypto, large GC pauses, and hot promise churn.

Recipe: production memory leak

  1. Split RSS, heap used, external, and array buffers.
  2. If heap grows, capture staged heap snapshots.
  3. If external grows, inspect buffers, native add-ons, compression, TLS, image processing, and WASM.
  4. Capture diagnostic report when memory crosses threshold.
  5. For fatal OOM, enable report-on-fatalerror and consider core dump on a canary only.
  6. Validate fix with allocation rate, retained objects, and steady-state RSS after warmup.

Recipe: trace context breaks at queue boundary

  1. Log current trace ID before enqueue.
  2. Inject W3C trace context into the message metadata.
  3. Extract context in the consumer before creating the consumer span.
  4. Use AsyncLocalStorage for process-local request metadata.
  5. If using a custom in-process queue, preserve context with AsyncResource or AsyncLocalStorage.snapshot().

Recipe: process hangs but does not crash

  1. Send report signal if enabled.
  2. Capture active handles and resource usage from the report.
  3. Capture CPU profile if the event loop is still running.
  4. Use trace events around suspected categories if reproducible.
  5. If frozen below JavaScript, use OS debugger or core capture.

Production artifact policy

| Artifact | Sensitivity | Retention | Access | |---|---|---| | Logs | Medium to high | Short by default, longer for audit streams | Service and incident responders | | Traces | Medium | Sampling and retention by SLO value | Engineers for owning service | | CPU profiles | High | Incident window only | Limited responders | | Heap snapshots | Very high | Short, encrypted | Explicit approval | | Diagnostic reports | High | Incident window only | Limited responders | | Core dumps | Critical | Short, encrypted, heavily audited | Small debug group |

Do not ship diagnostic artifacts to general log pipelines unless they have been scrubbed and size bounded. A heap snapshot can contain request bodies, tokens, session cookies, SQL strings, private keys, and user data.

Official reference anchors checked