Observability Diagnostics Inspector Tracing Profiling and Core Dumps
- Reading time
- 12 min read
- Word count
- 2266 words
- Diagram count
- 0 diagrams
Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/nodejs-v8-runtime-engineering/14 Observability Diagnostics Inspector Tracing Profiling and Core Dumps.md.
Purpose: Build a production diagnostic playbook for Node.js services that connects Node.js V8 Runtime Engineering, 15 Performance Engineering Benchmarking Flamegraphs GC and Event Loop Latency, diagnostics channels, async context, inspector sessions, trace events, diagnostic reports, heap snapshots, CPU profiles, and native core dumps into one evidence-first workflow.
14 Observability Diagnostics Inspector Tracing Profiling and Core Dumps
Operating model
Node.js diagnostics are strongest when each tool is treated as a different camera angle, not as a replacement for the others. Logs tell what application code believed was happening. Metrics tell how often and how severely it happened. Traces tell request causality across async boundaries. Inspector profiles tell where V8 and JavaScript spent CPU or retained heap. Trace events expose runtime timelines for libuv, V8, async hooks, and Node categories. Diagnostic reports snapshot process state. Core dumps preserve native memory after a hard abort.
Use this decision path:
| Symptom | First evidence | Deep evidence | Last resort |
|---|---|---|---|
| Slow endpoint | Request histogram, trace span, event loop delay | CPU profile, trace events | Native profile or core if process wedges |
| Memory growth | RSS, heap used, external memory, allocation rate | Heap snapshot, heap profile, diagnostic report | Core dump with llnode or native debugger |
| Crash | Error log, exit code, diagnostic report | Core dump, native stack, fatal error report | Reproduce under debug build |
| Context loss | Trace parent gaps, missing request ID | AsyncLocalStorage probes, diagnostics_channel spans | Async hooks investigation |
| Event loop stall | perf_hooks histogram, blocked request spans | CPU profile, sync I/O trace, flamegraph | Core if stuck in native code |
| Library black box | diagnostics_channel subscription | OpenTelemetry instrumentation wrapper | Inspector breakpoint in staging |
The field rule: start low overhead and always capture a timestamp, Node version, process ID, container ID, release SHA, command line, and clock source. A profile without the deployment identity is often only trivia.
Instrumentation layers
| Layer | Node primitive | Production use | Risk |
|---|---|---|---|
| Correlation context | AsyncLocalStorage from node:async_hooks | Request IDs, tenant IDs, trace correlation | Leaking context with enterWith() or custom async boundaries |
| Library diagnostics | node:diagnostics_channel | Publish structured events without hard dependency on telemetry vendors | Synchronous subscriber cost |
| Manual telemetry | OpenTelemetry API | Spans, metrics, logs around business operations | SDK must initialize before instrumented modules |
| Runtime metrics | node:perf_hooks, process.memoryUsage(), v8.getHeapStatistics() | Service SLOs and saturation signals | High cardinality labels and noisy intervals |
| Inspector | node:inspector and DevTools protocol | CPU profiles, heap snapshots, protocol automation | Security exposure and runtime overhead |
| Trace events | node:trace_events or CLI flags | Runtime timelines visible in Chrome trace viewer | Large files and category noise |
| Reports | process.report and report CLI flags | Crash and hang snapshots | Sensitive environment and network metadata unless excluded |
| Core dumps | OS core plus Node flags | Native post-mortem when process aborts | Huge artifacts with secrets and memory contents |
Diagnostics channel as the library seam
diagnostics_channel is the right API when a library wants to publish diagnostic data without depending on OpenTelemetry, a logger, or a metrics backend. Channel names should be stable, documented, and namespaced by package or subsystem. Message shape is part of the library contract.
// payment-diagnostics.mjs
import diagnosticsChannel from 'node:diagnostics_channel';
import { performance } from 'node:perf_hooks';
const paymentAttempt = diagnosticsChannel.channel('acme.payment.attempt');
export async function chargeCard(input, gateway) {
const start = performance.now();
const message = {
gateway: gateway.name,
currency: input.currency,
amount_minor: input.amountMinor,
};
if (paymentAttempt.hasSubscribers) {
paymentAttempt.publish({ ...message, phase: 'start', time_ms: start });
}
try {
const result = await gateway.charge(input);
if (paymentAttempt.hasSubscribers) {
paymentAttempt.publish({
...message,
phase: 'end',
status: result.status,
duration_ms: performance.now() - start,
});
}
return result;
} catch (error) {
if (paymentAttempt.hasSubscribers) {
paymentAttempt.publish({
...message,
phase: 'error',
error_name: error.name,
duration_ms: performance.now() - start,
});
}
throw error;
}
}
Subscriber pattern:
// telemetry-bootstrap.mjs
import diagnosticsChannel from 'node:diagnostics_channel';
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-diagnostics');
diagnosticsChannel.subscribe('acme.payment.attempt', (message) => {
if (message.phase !== 'end' && message.phase !== 'error') return;
const span = tracer.startSpan('payment.gateway.charge', {
attributes: {
'payment.gateway': message.gateway,
'payment.currency': message.currency,
'payment.amount_minor': message.amount_minor,
'payment.duration_ms': message.duration_ms,
},
});
if (message.phase === 'error') {
span.setStatus({ code: SpanStatusCode.ERROR, message: message.error_name });
}
span.end();
});
Production guidance:
| Practice | Why it matters |
|---|---|
| Create channels once at module top level | Dynamic channel lookup in hot code adds avoidable overhead and weakens naming discipline |
Check hasSubscribers before building expensive messages | It avoids allocation and serialization work when telemetry is disabled |
| Keep messages plain data | Subscribers may run in a different release or process shape than the publisher expects |
| Publish only bounded metadata | Do not put request bodies, tokens, SQL text with literals, or customer secrets into diagnostic messages |
| Document versioned message schemas | Telemetry consumers break when a field silently changes type or unit |
Footguns:
| Footgun | Failure mode | Fix |
|---|---|---|
| Subscriber throws | Publisher path fails because subscribers run synchronously | Wrap subscriber code and treat telemetry errors as telemetry failures |
Heavy message assembly before hasSubscribers | Instrumentation changes latency even when unused | Guard before expensive work |
| Per-request channel names | Cardinality explosion and memory churn | Encode request data in messages, not channel names |
| Mixing units | Profiles show milliseconds while histograms use nanoseconds | Put unit suffixes in field names |
TracingChannel for operation lifecycles
diagnostics_channel.tracingChannel(name) groups lifecycle channels for a single operation. It is useful for bridge instrumentation that wants start, end, async start, async end, and error semantics without hardcoding five independent names.
import diagnosticsChannel from 'node:diagnostics_channel';
const renderTrace = diagnosticsChannel.tracingChannel('acme.template.render');
export function renderTemplate(template, data) {
return renderTrace.traceSync(() => {
return template.render(data);
}, {
template_name: template.name,
key_count: Object.keys(data).length,
});
}
Use TracingChannel for reusable operation boundaries. Use plain channels for one-way events such as cache invalidation, retry scheduled, connection created, or feature flag evaluated.
Async context and trace correlation
Modern Node async context tracking should prefer stable AsyncLocalStorage for request-scoped values. Low-level async_hooks.createHook() remains powerful but has safety and performance risks. Use it for diagnostics experiments and framework internals, not as the default app-level context primitive.
// request-context.mjs
import { AsyncLocalStorage } from 'node:async_hooks';
import crypto from 'node:crypto';
export const requestContext = new AsyncLocalStorage();
export function withRequestContext(req, res, next) {
const context = {
request_id: req.headers['x-request-id'] || crypto.randomUUID(),
route: req.route?.path || req.url,
started_at_ms: Date.now(),
};
requestContext.run(context, next);
}
export function log(fields) {
const context = requestContext.getStore();
console.log(JSON.stringify({ ...context, ...fields }));
}
Context troubleshooting:
| Symptom | Likely cause | Probe |
|---|---|---|
| Request ID disappears after queue callback | Custom queue did not preserve async context | Wrap task execution in AsyncResource or capture with AsyncLocalStorage.snapshot() |
| All requests share one context | enterWith() used in a shared event handler | Prefer run() around a request boundary |
| Trace parent missing in manual span | SDK initialized too late or active context absent | Log active span before and after the boundary |
| Worker thread has no context | Async IDs and context are per thread | Propagate trace and request fields in worker messages |
| Context survives after request | Long-lived promise or timer retained request store | Cancel timers and avoid storing large objects in the context |
OpenTelemetry context propagation depends on a context manager. In normal Node applications the SDK usually installs one, but custom setups must ensure a context manager is enabled before spans are created. Initialize instrumentation before application modules, commonly with node --import ./instrumentation.mjs app.js on supported Node versions.
OpenTelemetry bridge discipline
For applications, initialize the SDK. For libraries, depend only on the OpenTelemetry API and let the host application choose SDK, exporters, sampling, and resources.
// instrumentation.mjs
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter(),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
process.on('SIGTERM', async () => {
await sdk.shutdown();
process.exit(0);
});
Operational checklist:
| Check | Expected state |
|---|---|
| Bootstrap order | Instrumentation file loads before app imports HTTP, database, queue, and RPC clients |
| Resource identity | service.name, service.version, deployment environment, instance ID, and region are set |
| Sampling policy | Head sampling is explicit for high-volume services |
| Propagation | Incoming and outgoing trace headers are tested at service boundaries |
| Shutdown | SDK flushes before process exit during rolling deploys |
| Cardinality | User ID, URL with IDs, SQL literals, and raw error messages are not metric labels |
Footgun: default all-span sampling is useful while learning but expensive at production scale. Set the sampling policy deliberately before a high-traffic rollout.
Inspector surfaces
The inspector exposes V8 and runtime state through the Chrome DevTools Protocol. It can be used interactively with --inspect or programmatically with node:inspector.
| Mode | Command or API | Use |
|---|---|---|
| Interactive debug | node --inspect app.js | Breakpoints, heap snapshots, live inspection |
| Break on start | node --inspect-brk app.js | Startup bugs and module initialization |
| Programmatic CPU profile | node:inspector/promises Profiler.start and Profiler.stop | Short profiles around a controlled workload |
| Programmatic heap snapshot | HeapProfiler.takeHeapSnapshot | Capture retainers during staged leak reproduction |
| Production emergency | Bind inspector only to loopback or an isolated debug endpoint | Short diagnostic window with access controls |
Programmatic CPU profile:
import { Session } from 'node:inspector/promises';
import fs from 'node:fs';
export async function captureCpuProfile(path, fn) {
const session = new Session();
session.connect();
try {
await session.post('Profiler.enable');
await session.post('Profiler.start');
const result = await fn();
const { profile } = await session.post('Profiler.stop');
fs.writeFileSync(path, JSON.stringify(profile));
return result;
} finally {
session.disconnect();
}
}
Programmatic heap snapshot:
import { Session } from 'node:inspector/promises';
import fs from 'node:fs';
export async function writeInspectorHeapSnapshot(path) {
const session = new Session();
const fd = fs.openSync(path, 'w');
session.connect();
session.on('HeapProfiler.addHeapSnapshotChunk', (message) => {
fs.writeSync(fd, message.params.chunk);
});
try {
await session.post('HeapProfiler.takeHeapSnapshot');
} finally {
session.disconnect();
fs.closeSync(fd);
}
}
Security rules:
| Rule | Reason |
|---|---|
| Never expose inspector publicly | Inspector can evaluate code and inspect secrets |
| Avoid long-lived production inspector sessions | Profiles, heap snapshots, and console object retention can change process behavior |
| Capture artifacts to encrypted storage | Profiles and dumps may contain PII, tokens, SQL strings, and environment values |
| Prefer time-boxed debug pods or replicas | Keep normal fleet instances predictable |
Trace events
Trace events produce Chrome trace compatible timelines. They are valuable when a CPU profile alone cannot answer why work happened in a particular order.
CLI capture:
node \
--trace-events-enabled \
--trace-event-categories node.perf,node.async_hooks,v8 \
--trace-event-file-pattern 'trace-${pid}-${rotation}.json' \
server.mjs
Programmatic capture:
import { createTracing, getEnabledCategories } from 'node:trace_events';
const tracing = createTracing({
categories: ['node.perf', 'node.async_hooks', 'v8'],
});
tracing.enable();
console.log(getEnabledCategories());
setTimeout(() => {
tracing.disable();
}, 30_000);
Trace event use cases:
| Use case | Categories | Notes |
|---|---|---|
| Async causality | node.async_hooks | High volume in busy services |
| Runtime performance | node.perf | Useful with perf_hooks marks and measures |
| V8 behavior | v8 | Correlate GC and compilation with latency |
| Startup analysis | node.bootstrap, v8 | Compare cold starts across releases |
Troubleshooting:
| Problem | Cause | Fix |
|---|---|---|
| Trace file too large | Too many categories or too long a window | Use a short capture window and narrow categories |
| No useful app labels | Only runtime events were captured | Add perf marks, diagnostics_channel messages, or trace spans around app operations |
| Timeline hard to read | Multiple processes write overlapping files | Include ${pid} in the file pattern |
Diagnostic reports
Diagnostic reports are JSON snapshots of process state. They are useful during crashes, fatal errors, hangs, and manual incident capture.
Enable report generation:
node \
--report-on-fatalerror \
--report-uncaught-exception \
--report-on-signal \
--report-signal=SIGUSR2 \
--report-dir=/var/log/node-reports \
--report-exclude-env \
--report-exclude-network \
server.mjs
Manual capture:
import fs from 'node:fs';
export function writeIncidentReport(reason) {
const safeReason = reason.replace(/[^a-z0-9_.-]/gi, '_').slice(0, 80);
const path = `/var/log/node-reports/report-${process.pid}-${safeReason}.json`;
process.report.writeReport(path);
fs.chmodSync(path, 0o600);
return path;
}
Report reading guide:
| Section | What to inspect |
|---|---|
| Header | Node version, command line, platform, report trigger |
| JavaScript stack | The active stack when the report was created |
| Native stack | C++ and runtime frames, useful for native add-ons |
| Heap statistics | Heap limit, used heap, available heap |
| Resource usage | CPU time, RSS, page faults, file descriptors |
| libuv handles | Active timers, TCP handles, async resources |
| Environment variables | Usually exclude in production artifacts |
Reports are not heap snapshots. They summarize memory and handles but do not show object retainer paths. If the problem is a JavaScript leak, move from report to heap snapshot. If the process aborts in native code, move from report to core dump.
V8 heap snapshots and heap profiles
v8.writeHeapSnapshot() writes a heap snapshot for the current isolate. It is synchronous, can block the event loop, and can require memory roughly proportional to the live heap. That makes it dangerous under memory pressure.
import v8 from 'node:v8';
import fs from 'node:fs';
export function captureHeapSnapshot(label) {
const path = v8.writeHeapSnapshot(`/var/log/node-heap/${label}-${process.pid}.heapsnapshot`);
fs.chmodSync(path, 0o600);
return path;
}
Production guidance:
| Scenario | Safer approach |
|---|---|
| Suspected slow leak | Capture snapshots on a canary at low traffic before and after growth |
| Near OOM | Prefer diagnostic report and external memory metrics first |
| Worker thread leak | Trigger snapshots per worker, not only from the main thread |
| Sensitive tenant data | Encrypt artifact storage and enforce short retention |
| Huge heap | Use heap sampling or isolate a replica with a smaller workload |
Heap snapshot workflow:
- Capture a baseline after warmup.
- Run a known workload.
- Capture a second snapshot.
- Compare retained size and object counts by constructor.
- Inspect dominators and retaining paths.
- Validate with a fixed build under the same workload.
Core dumps
Core dumps are for post-mortem debugging when JavaScript level artifacts are insufficient, especially fatal native crashes, aborts, C++ add-ons, corrupted heap, illegal instruction, or a process stuck below JavaScript.
Node flags and OS setup:
ulimit -c unlimited
node \
--abort-on-uncaught-exception \
--report-on-fatalerror \
--report-dir=/var/log/node-reports \
server.mjs
Container notes:
| Concern | Production answer |
|---|---|
| Core file location | Configure host or container runtime core pattern deliberately |
| File size | Expect large files, often near process virtual memory size |
| Secrets | Treat cores as highly sensitive artifacts |
| Symbols | Preserve Node binary, native add-on builds, source maps, and release metadata |
| Runtime limits | Ensure ulimit, security profile, and writable dump path allow capture |
Native post-mortem checklist:
- Record Node version, exact binary, container image digest, and architecture.
- Preserve the core, executable, native add-ons, and generated reports together.
- Inspect native stack with
lldb,gdb, or platform debugger. - If using llnode, match the Node and V8 version closely.
- Correlate crash time with OTel traces, logs, kernel messages, and deployment events.
Incident recipes
Recipe: unexplained p99 spike
- Confirm p99 increase in request histogram.
- Check event loop delay histogram and event loop utilization from 15 Performance Engineering Benchmarking Flamegraphs GC and Event Loop Latency.
- Capture a 30 second CPU profile on one hot replica.
- Capture trace events with
node.perfandv8if CPU profile shows runtime noise. - Compare spans for slow and normal requests.
- Look for sync I/O, JSON serialization, regex backtracking, compression, crypto, large GC pauses, and hot promise churn.
Recipe: production memory leak
- Split RSS, heap used, external, and array buffers.
- If heap grows, capture staged heap snapshots.
- If external grows, inspect buffers, native add-ons, compression, TLS, image processing, and WASM.
- Capture diagnostic report when memory crosses threshold.
- For fatal OOM, enable report-on-fatalerror and consider core dump on a canary only.
- Validate fix with allocation rate, retained objects, and steady-state RSS after warmup.
Recipe: trace context breaks at queue boundary
- Log current trace ID before enqueue.
- Inject W3C trace context into the message metadata.
- Extract context in the consumer before creating the consumer span.
- Use AsyncLocalStorage for process-local request metadata.
- If using a custom in-process queue, preserve context with AsyncResource or
AsyncLocalStorage.snapshot().
Recipe: process hangs but does not crash
- Send report signal if enabled.
- Capture active handles and resource usage from the report.
- Capture CPU profile if the event loop is still running.
- Use trace events around suspected categories if reproducible.
- If frozen below JavaScript, use OS debugger or core capture.
Production artifact policy
| Artifact | Sensitivity | Retention | Access | |---|---|---| | Logs | Medium to high | Short by default, longer for audit streams | Service and incident responders | | Traces | Medium | Sampling and retention by SLO value | Engineers for owning service | | CPU profiles | High | Incident window only | Limited responders | | Heap snapshots | Very high | Short, encrypted | Explicit approval | | Diagnostic reports | High | Incident window only | Limited responders | | Core dumps | Critical | Short, encrypted, heavily audited | Small debug group |
Do not ship diagnostic artifacts to general log pipelines unless they have been scrubbed and size bounded. A heap snapshot can contain request bodies, tokens, session cookies, SQL strings, private keys, and user data.
Cross-links
- Root map: Node.js V8 Runtime Engineering
- Performance playbook: 15 Performance Engineering Benchmarking Flamegraphs GC and Event Loop Latency
- This note feeds incident evidence into benchmarking, flamegraphs, GC analysis, and event loop latency work.
Official reference anchors checked
- Node.js diagnostics_channel API: https://nodejs.org/api/diagnostics_channel.html
- Node.js asynchronous context tracking and async_hooks docs: https://nodejs.org/api/async_context.html and https://nodejs.org/api/async_hooks.html
- Node.js perf_hooks API: https://nodejs.org/api/perf_hooks.html
- Node.js inspector API: https://nodejs.org/api/inspector.html
- Node.js trace events API: https://nodejs.org/api/tracing.html
- Node.js diagnostic report API: https://nodejs.org/api/report.html
- Node.js V8 API: https://nodejs.org/api/v8.html
- Node.js CLI diagnostic and profiling flags: https://nodejs.org/api/cli.html
- OpenTelemetry JavaScript instrumentation, context, propagation, resources, and sampling docs: https://opentelemetry.io/docs/languages/js/