Kernel Architecture Modules Drivers and Device Model

Reading time
21 min read
Word count
4196 words
Diagram count
9 diagrams

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/linux-systems-engineering/13 Kernel Architecture Modules Drivers and Device Model.md.

Purpose: Build a production-grade mental model of Linux kernel architecture, modules, drivers, subsystem boundaries, concurrency primitives, crash signals, configuration, and build discipline without treating kernel patching as a default operations tool.

Kernel Architecture, Modules, Drivers, And Device Model

Related notes: Linux Systems Engineering, 01 Linux Mental Model User Space Kernel and Hardware, 02 Processes Threads Scheduling Signals and Jobs, 03 Memory Virtual Memory Paging Allocators and OOM, 04 Filesystems VFS Block IO Page Cache and Storage, 05 Linux Networking TCP IP Routing Firewalling and DNS, 06 System Calls ABI libc and User Kernel Boundaries

The Linux kernel is a privileged, shared, long-lived runtime. It is not a library linked into each application and it is not a microkernel that keeps most services in separate user-space servers. The practical model is a monolithic kernel with loadable modules: core subsystems, filesystems, protocol stacks, memory management, scheduling, security hooks, and drivers execute in one kernel address space, while modules can be inserted and removed at runtime if the build, policy, and current references allow it.

That model is powerful and unforgiving. A driver bug can corrupt unrelated kernel memory. A blocking call in atomic context can deadlock a host. A wrong lock order can freeze CPUs that have no relationship to the device being debugged. A local learning machine can tolerate experiments with custom kernels, unsigned modules, debug configs, and crash loops. A production host or cluster node cannot be treated that way: the node may be carrying customer traffic, persistent storage, CNI dataplane state, eBPF programs, kubelet responsibilities, hardware queues, and scheduler placement assumptions.

Architecture Map

Rendering diagram...

The important boundary is not "kernel code" versus "driver code". Device drivers are kernel code. Filesystems are kernel code. Network protocols are kernel code. Many features that look like optional plugins are compiled into the kernel image or loaded as modules, but once executing they run with kernel privilege and kernel failure modes.

Monolithic Kernel Model

Linux keeps major operating-system services inside one kernel image and one kernel address space. System calls enter the kernel, perform work through internal subsystem interfaces, and return to user space. Internal kernel APIs are not stable in the same way user-kernel ABIs are stable. Distribution kernels may backport features, carry vendor patches, enable or disable config options, or expose module signing and lockdown policies that differ from upstream defaults.

PropertyPractical meaningField consequence
Monolithic address spaceCore kernel, built-in drivers, and loaded modules share privileged memory.A bad module can corrupt state outside its apparent feature area.
Stable user-kernel ABINormal programs rely on documented syscalls, ioctls, procfs, sysfs, netlink, and device nodes.Prefer user-space APIs over depending on internal symbols.
Unstable internal ABIInternal structs, functions, and locking rules can change between kernel versions.Out-of-tree modules carry ongoing maintenance risk.
Config-shaped behaviorCONFIG_* selections decide compiled code, debug features, hardening, and module support.uname -r is not enough; inspect config and distro patch level.
Shared cluster kernelContainers share the node kernel even when namespaces hide most global state.A kernel bug or unsafe module is a node-wide and often cluster-wide risk.

Local learning machine guidance: build kernels, toggle configs, crash VMs, load toy modules, read oops traces, and use debug options. Production guidance: prefer vendor-supported kernels, staged rollout, canaries, rollback entries, crash dump configuration, module allowlists, and observable success criteria before changing anything below user space.

Modules

A module is an object file built for a specific kernel interface and configuration. It can provide a driver, filesystem, protocol, security feature, tracing helper, or other extension. Loading a module links it into the running kernel. The loader resolves exported symbols, checks version metadata when configured, applies relocation, runs module init code, and records the module in kernel state.

Common module commands:

uname -r
lsmod
modinfo ixgbe
modprobe vcan
modprobe -r vcan
insmod ./example.ko
rmmod example
depmod -a
cat /proc/modules
find /lib/modules/"$(uname -r)" -type f -name '*.ko*' | head

modprobe understands module aliases, dependency metadata, blacklists, install rules, and /lib/modules/$(uname -r). insmod attempts to insert exactly the file you give it. In production, modprobe is usually the safer administrative tool because it uses the distribution's module dependency database. insmod belongs mostly in controlled development and break-glass debugging.

ConcernLocal learning machineProduction host or cluster
Unsigned modulesUseful for learning module lifecycle.Usually reject or tightly control through Secure Boot, lockdown, signing, and policy.
Out-of-tree modulesGood for understanding driver APIs.Treat as kernel-adjacent supply chain and reliability risk.
Module unloadingGood for iteration.Avoid unloading active storage, network, security, or filesystem modules unless a vendor runbook says it is safe.
Kernel taintAcceptable during labs.Capture taint state and expect upstream or vendor support to ask for reproduction without taint.
Version mismatchTeaches vermagic and symbol versioning.A deployment blocker, not something to force through.

Loading And Unloading Path

Rendering diagram...

Unloading is not symmetric with loading. A module may refuse to unload because objects still hold references. Forced removal is a lab maneuver and a production hazard. A driver that is servicing block IO, owning a network interface, holding timers, exposing sysfs files, or waiting for asynchronous work can leave live callbacks behind if its teardown is wrong.

Common mistakes:

MistakeFailure modeBetter practice
Loading a module built for a different kernel releaseInvalid module format, unresolved symbols, or subtle ABI breakage.Build against the exact headers and config for the target kernel.
Ignoring module parametersDriver defaults surprise network, storage, or interrupt behavior.Read modinfo -p, distro docs, and current /sys/module/<name>/parameters.
Treating lsmod absence as feature absenceFeature may be built in.Check /boot/config-*, /proc/config.gz, sysfs, and subsystem evidence.
Removing a module to "reset" hardwareHost loses disks, NICs, or security hooks.Use device-specific reset, maintenance windows, and vendor guidance.
Using DKMS blindly on cluster nodesModule build changes node behavior outside image promotion.Bake, sign, test, and roll out modules through the same node lifecycle as kernels.

Device Model

The Linux device model connects devices, buses, drivers, classes, firmware descriptions, hotplug, sysfs, and user-space device management. The core idea is binding: a device appears on a bus, the driver core matches it to a driver using IDs or firmware data, the driver's probe function initializes it, and user space sees a representation through sysfs, devtmpfs, netlink events, and sometimes /dev nodes created or labeled by udev.

Rendering diagram...

The device model is why the same NIC driver can bind when a PCI device appears at boot or when a hotplug device appears later. It is also why driver bugs often show as probe failures, deferred probes, missing firmware, DMA mapping errors, interrupt storms, or sysfs state that does not match operator expectation.

Production guidance:

TaskEvidence to collect first
Device missinglspci -nn, lsusb, dmesg -T, /sys/bus/*/devices, firmware logs.
Driver not boundmodinfo, /sys/bus/*/drivers, modalias, blacklists, kernel config, Secure Boot policy.
Probe failedKernel logs around probe, missing firmware, resource conflicts, ACPI or Device Tree data, IOMMU errors.
Hotplug unreliableudev rules, power management, link state, cable or slot events, bus reset logs.
Cluster node differsNode image, kernel version, module set, firmware package, BIOS settings, PCI topology, CNI or CSI daemonset state.

Character, Block, And Network Devices

Drivers expose different kernel interfaces depending on the resource being modeled.

Device classUser-visible shapeKernel pathTypical examplesOperational risk
Character deviceByte stream or command interface under /dev.file_operations, major/minor numbers, ioctl, read/write, mmap.TTYs, serial ports, /dev/kmsg, GPUs, watchdogs, random devices, misc devices.ioctls are often device-specific and can expose large privileged surfaces.
Block deviceRandom-access storage device with request queues.Block layer, queue limits, bio, elevator or blk-mq, filesystem above.NVMe, SCSI disks, loop devices, dm-crypt, LVM, mdraid.Bugs can corrupt data, wedge IO, or trigger node-wide hangs in uninterruptible sleep.
Network devicePacket interface with link state and queues.netdev, NAPI, qdisc, XDP, protocol stack, ethtool.Ethernet NICs, veth, bridge, bond, VLAN, WireGuard, virtual CAN.Driver or offload bugs can look like firewall, MTU, TLS, or application failures.

Character devices are often narrow but sharp. A watchdog char device can reboot a host if mishandled. A GPU or accelerator char device may combine memory mapping, DMA, synchronization objects, and ioctls. A local lab can use toy char drivers to learn open, read, write, poll, and ioctl. In production, unknown char devices should be treated as privileged kernel-facing APIs.

Block devices sit below filesystems and page cache. They need correct queue limits, flush behavior, discard behavior, write barriers, error propagation, and hot removal handling. In clusters, a CSI stack, multipath setup, dm layer, cloud volume attachment, and node kernel all contribute to what looks like a single PVC symptom.

Network devices sit between hardware or virtual dataplanes and the protocol stack. NAPI polling, interrupt moderation, GRO/GSO/TSO, RSS, qdisc, XDP, and driver offloads can make a bug appear only under load. Local tests with ip link, veth, bridge, and vcan are useful, but production clusters add CNI policy, kube-proxy or eBPF service handling, overlay encapsulation, MTU constraints, and host firewall rules.

Syscall Table Overview

The syscall table is the architecture-specific dispatch map from syscall numbers to kernel entry functions. User space does not call arbitrary kernel functions. It enters through an architecture ABI, passes a syscall number plus arguments, and the kernel dispatches to the implementation for that syscall number.

Rendering diagram...

The syscall table is useful for understanding tracing, seccomp filters, audit events, and architecture differences. It is not a production extension point. Hooking syscall tables is associated with rootkits, brittle observability, and unsupported modules. Modern instrumentation should prefer tracepoints, kprobes with care, eBPF where appropriate, LSM hooks where intended, audit, perf, or documented subsystem APIs.

For boundary details, connect this note to 06 System Calls ABI libc and User Kernel Boundaries. The production lesson is simple: syscalls are stable contracts; internal syscall implementation details are not.

VFS Layer

The Virtual Filesystem layer gives Linux a common file model across ext4, XFS, tmpfs, procfs, sysfs, overlayfs, network filesystems, block-backed filesystems, and device-backed special files. VFS handles path lookup, dentries, inodes, superblocks, mount namespaces, open file descriptions, file operations, permissions, and delegation to filesystem-specific methods.

Rendering diagram...

VFS is why "everything is a file" is useful but incomplete. A path may name a regular file, directory, procfs attribute, sysfs control, socket, FIFO, char device, block device, or overlay object. The same read syscall can hit page cache, a driver callback, a procfs generator, or a network filesystem.

Common mistakes:

MistakeWhy it fails
Debugging an open path without checking mount namespaceThe process may see a different mount tree from the shell.
Treating /proc values as stable file formatsMany are diagnostic interfaces, not strict application APIs.
Assuming block device health from filesystem symptoms onlyPage cache, journal, IO scheduler, dm layer, and device driver may all be involved.
Ignoring dentry and inode cachingDeleted files can stay open and consume space until the last FD closes.

Network Stack

The network stack connects socket syscalls, protocol state, routing, neighbor discovery, netfilter, traffic control, device queues, NAPI, drivers, and hardware offloads. Packets may be processed by softirqs, ksoftirqd, qdisc code, XDP, tc, conntrack, tunnels, and virtual devices before an application sees a byte.

Rendering diagram...

Production guidance:

SymptomKernel-side angles
High packet lossNIC ring drops, qdisc drops, conntrack table pressure, GRO/offload behavior, driver errors, MTU.
High CPU in softirqNAPI budget, interrupt affinity, RSS, RPS/XPS, packet rate, firewall rules, encapsulation.
Latency spikesCoalescing, qdisc backlog, ksoftirqd scheduling, lock contention, power management, CPU isolation mistakes.
Cluster service failureCNI device graph, eBPF or iptables service rules, overlay MTU, node conntrack, host routing policy.

On a local learning machine, virtual devices are excellent: ip link add veth0 type veth peer name veth1, ip link add br0 type bridge, modprobe vcan. On production nodes, virtual devices are owned by a runtime or CNI. Do not delete interfaces, flush qdiscs, or unload drivers without knowing which daemon will reconcile or break afterward.

Memory Manager

The memory manager owns virtual memory, page allocation, reclaim, swap, NUMA policy, memory cgroups, slab caches, huge pages, page tables, and user-kernel copy boundaries. Kernel code must be explicit about allocation context. GFP_KERNEL can sleep and reclaim; atomic contexts need different flags and smaller ambitions. User pointers are not kernel pointers and must go through the correct access helpers.

Important production distinctions:

AreaWhat it meansFailure mode
Page allocatorAllocates physical pages under constraints.Allocation failure, reclaim stalls, compaction stalls, OOM.
Slab or SLUBCaches kernel objects.Leaks, use-after-free, fragmentation, cache pressure.
vmallocVirtually contiguous kernel memory.TLB overhead, limited address-space areas, not always suitable for DMA.
DMA memoryDevice-visible memory with mapping rules.Data corruption if cache coherency and IOMMU rules are wrong.
memcgPer-cgroup accounting and limits.Container OOM while host has free memory.
User copyControlled copy between user and kernel address spaces.EFAULT, security bugs, sleeping in invalid context.

For broader memory operations, see 03 Memory Virtual Memory Paging Allocators and OOM. Kernel development adds stricter context rules: before allocating or copying, ask whether the current code can sleep, whether it holds a spinlock, whether it runs in interrupt or softirq context, whether reclaim can recurse into this path, and whether the memory must be DMA-capable.

Scheduler And Kernel Threads

The scheduler chooses runnable tasks for CPUs. In Linux, tasks include normal user-space threads and kernel threads. Kernel threads have comm names such as kworker/*, ksoftirqd/*, rcu*, migration/*, watchdog/*, jbd2/*, and driver-specific workers. They do not have a normal user-space memory image, but they are scheduled entities and can consume CPU, block, inherit priority rules, and show up in ps, top, perf, and traces.

Kernel thread familyWhat it usually signals
kworker/*Workqueue execution. High CPU means queued kernel work, not automatically a scheduler bug.
ksoftirqd/*Softirq processing overflow or threaded softirq work under load.
rcu*RCU callback, grace period, or offload activity.
jbd2/*ext4 journal work. Can indicate storage latency or writeback pressure.
watchdog/*Lockup detection.
irq/*Threaded interrupt handlers.

Troubleshooting high kernel-thread CPU:

ps -eLo pid,tid,comm,psr,stat,pcpu,wchan:24 | sort -k6 -nr | head
cat /proc/softirqs
cat /proc/interrupts
perf top -g
cat /sys/kernel/debug/workqueue/* 2>/dev/null | head

Production caution: perf, ftrace, debugfs, and eBPF tracing may require elevated privileges and can expose sensitive data. Use bounded windows and avoid changing scheduler tunables on cluster nodes unless the workload owner and node owner both understand the blast radius.

RCU Overview

RCU means Read-Copy-Update. It is a synchronization family optimized for read-mostly data. Readers run with very low overhead inside RCU read-side critical sections. Updaters publish a new version, remove or replace old references carefully, then wait for a grace period before freeing memory that old readers might still hold.

Rendering diagram...

RCU is not a general replacement for locks. It is excellent when reads dominate, readers need low overhead, and updates can tolerate copy-and-publish discipline. It is dangerous when developers forget object lifetime, mix RCU and normal locking without a clear ownership rule, or free memory before a grace period.

PatternUse RCU?Reason
Read-mostly routing or lookup tableOften yesReaders stay fast while updates publish new state.
Short mutable counterUsually noAtomic operations or per-CPU counters are simpler.
Complex multi-field mutationMaybe with locksRCU can protect lookup while another lock protects mutation.
Blocking operation inside read-side critical sectionBe carefulRules vary by RCU flavor and kernel config.

Locking Primitives

Kernel locking starts with context. Can this code sleep? Can it run in hard IRQ, softirq, tasklet, timer, workqueue, syscall, or kernel-thread context? Can the same data be touched from another CPU? Does the lock protect data or merely surround code?

PrimitiveSleeps?Typical useBad fit
SpinlockNoShort critical sections that may be used where sleeping is forbidden.Long operations, memory allocation with sleep, user copy, calling unknown callbacks.
MutexYesProcess context or kernel thread paths that may block.Interrupt, softirq, tasklet, or spinlock-held paths.
Atomic operationNoSimple counters, flags, reference transitions, bit operations.Compound invariants that need multiple fields to change together.
RCUReaders do not normally blockRead-mostly pointer-protected data with deferred freeing.General mutual exclusion or write-heavy state.
rwsem/rwlockDepends on typeReader-writer protection when read sections justify complexity.Short sections where simple spinlock or mutex is enough.
seqlock/seqcountWriters serialize, readers retrySmall data read consistently, such as time-like state.Data containing pointers that readers cannot safely retry around.

Spinlocks are about atomic context and CPU-level exclusion. Holding a spinlock while doing expensive work is a latency bug. Holding a spinlock while calling code that might sleep is a correctness bug. spin_lock_bh() disables softirqs on the local CPU before taking the lock. spin_lock_irqsave() saves interrupt state and disables interrupts before locking, which is useful when the same lock can be touched from interrupt context.

Mutexes are sleeping locks. If a mutex is contended, the task can block and the scheduler can run something else. That makes mutexes efficient for longer process-context critical sections, but invalid from hard IRQ and softirq context. If code path context is uncertain, prove it before choosing a mutex.

Atomic operations are not tiny locks. They are useful when the invariant fits into the atomic operation. If a state transition needs "increment this, add to list, check flag, publish pointer", a single atomic counter does not protect the whole invariant.

Common locking mistakes:

MistakeResult
Taking the same non-recursive lock twiceDeadlock.
Holding a spinlock across copy_to_user or copy_from_userSleep-in-atomic bugs or lockups.
Calling arbitrary callbacks with a lock heldLock-order inversion and subsystem deadlocks.
Protecting code instead of dataFuture changes bypass the real invariant.
Assuming uniprocessor tests prove lockingCONFIG_SMP, CONFIG_PREEMPT, and real interrupt timing expose bugs.
Forgetting lockdep in development kernelsMissed lock-order bugs before production.

Deferred Work: Workqueues, Softirqs, Tasklets, And Threads

Kernel work often cannot be done at the moment an event arrives. Interrupt handlers need to finish quickly. Some work must run soon but cannot sleep. Other work may sleep and should run in process context. Linux provides several deferred execution mechanisms.

MechanismContextCan sleep?Typical useProduction note
SoftirqSoftirq context, per-CPU, high priority.NoNetworking receive/transmit, timers, block completions, RCU callbacks.High softirq load can starve normal work until pushed into ksoftirqd.
TaskletBuilt on softirq semantics, serialized per tasklet.NoLegacy deferred driver work.The API is deprecated in current kernel headers; prefer modern alternatives for new work.
WorkqueueKernel worker thread or BH workqueue depending on type.Threaded work can sleep.Driver and subsystem asynchronous work.Use dedicated workqueues when producers can flood or reclaim paths need forward progress.
Threaded IRQIRQ work in a schedulable kernel thread.Thread function can sleep subject to IRQ rules.Drivers that need lower hard-IRQ latency or RT-friendly handling.Prefer for many modern driver interrupt designs.
Kernel threadDedicated schedulable task.Yes, in normal process context.Long-lived subsystem loops, daemons, special scheduling needs.More explicit lifecycle and resource ownership than generic workqueue items.

Softirqs are fast and constrained. They are appropriate for high-frequency paths such as network packet processing, but they make latency and CPU accounting less intuitive. ksoftirqd/N running hot means the system is spending significant time on deferred kernel work for CPU N, not that ksoftirqd itself is the root cause.

Tasklets are legacy. Existing drivers may still use them, but new designs should normally choose threaded interrupts, workqueues, timers, NAPI, or subsystem-specific mechanisms. The replacement depends on required context: choose threaded IRQs for interrupt work that should be schedulable, workqueues for sleepable asynchronous work, and softirq/NAPI only when the performance and context constraints justify it.

Workqueues are the common asynchronous execution API. Modern concurrency-managed workqueues share worker pools and regulate concurrency. Use WQ_MEM_RECLAIM when work participates in memory reclaim paths. Do not dump unbounded producer work into a system workqueue and assume fairness will save you.

Kernel Panics, Oops, Taint, And Logs

An oops is a kernel-detected fault where the kernel may kill the current task and continue. A panic is a decision that the system cannot safely continue. A warning is diagnostic evidence, not necessarily a crash. A tainted kernel is one whose state includes events that may affect debugging trust, such as proprietary modules, forced module operations, machine checks, or prior warnings depending on flags.

Rendering diagram...

Kernel logs are emitted through printk and stored in the kernel log buffer, visible through /dev/kmsg, dmesg, and often journald. Console log level controls what reaches the console immediately; it does not decide whether a message exists in the ring buffer. Excessive logging in hot paths can create its own latency or lockup risks, especially with slow consoles.

Commands for incident capture:

uname -a
cat /proc/sys/kernel/tainted
dmesg -T --level=err,warn,crit,alert,emerg
journalctl -k -b
cat /proc/modules
cat /proc/cmdline
zcat /proc/config.gz 2>/dev/null | head

Production guidance:

SignalWhat to do
OopsPreserve full trace, modules list, taint state, kernel version, workload context, and recent hardware or driver changes.
PanicConfirm crash dump or pstore capture, node reboot policy, fencing behavior, and whether the cluster rescheduled workload safely.
TaintDecode flags and reproduce without taint when seeking upstream or vendor support.
Repeated WARNTreat as a bug signal if correlated with workload or later failure.
Log floodRate-limit the source, reduce console verbosity, and capture bounded evidence before the log flood hides the first fault.

On local learning machines, enable crash-friendly settings and practice reading traces. On production clusters, configure kdump or pstore before incidents, make node replacement routine, and avoid "debug by reboot loop" because it destroys first-failure evidence.

Kernel Config And Build Overview

Kernel behavior is selected at build time through Kconfig symbols, Makefiles, architecture support, compiler choices, and distribution patches. A config option may build code in, build it as a module, or omit it. The same apparent version can behave differently across distributions because of backports and config choices.

Useful config evidence:

uname -r
cat /boot/config-"$(uname -r)" 2>/dev/null | grep CONFIG_MODULES
zcat /proc/config.gz 2>/dev/null | grep CONFIG_PREEMPT
modinfo module_name

High-level build flow for a local or lab kernel:

git clone https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
cd linux
make olddefconfig
make menuconfig
make -j"$(nproc)"
make modules
sudo make modules_install
sudo make install

External module builds require a prepared or built kernel tree with matching headers and configuration. modules_prepare prepares enough for many external module builds, but a full kernel build is needed for some versioning artifacts. Production module builds should be reproducible, signed when policy requires it, tied to the exact target kernel, and promoted with the node image or package pipeline.

Build choiceLocal learning machineProduction host or cluster
Upstream vanilla kernelExcellent for learning subsystem behavior.Usually not the operational default unless your org owns kernel support.
Distro kernel sourceGood for understanding real host behavior.Preferred when support contract and patch stream matter.
Debug configUseful for lockdep, KASAN, KCSAN, tracing labs.Use canaries or dedicated debug nodes because overhead and behavior change.
Custom patchAcceptable in a disposable VM.Requires design review, security review, rollback, crash capture, performance testing, and owner signoff.
External moduleGood for driver API learning.Treat as production code with kernel-level blast radius.

When Not To Patch The Kernel

Do not patch the kernel just because the kernel is where a symptom appears. Many production kernel symptoms are caused by user-space policy, firmware, hardware, config, limits, or workload shape.

Prefer these before a kernel patch:

NeedUsually better first move
Observe application behaviorstrace, perf, tracepoints, eBPF tracing, logs, metrics.
Enforce service policysystemd unit settings, cgroups, namespaces, seccomp, LSM policy.
Filter trafficnftables, tc, XDP/eBPF, CNI policy, load balancer config.
Add device supportVendor-supported kernel, firmware update, existing driver, distro backport.
Fix performanceTune workload, IRQ affinity, queue depth, sysctls, scheduler class, storage layout.
Work around kernel bugUpgrade to supported kernel or apply vendor patch before carrying private code.

Patch the kernel only when the required behavior truly belongs in kernel space, no supported interface can express it, the change can be tested across relevant hardware and workloads, and the team is prepared to own security, crash, performance, and upgrade consequences.

Troubleshooting Field Flow

Rendering diagram...

Practical triage checklist:

QuestionWhy it matters
Did the first kernel log line precede all later noise?Later traces may be consequences.
Is the kernel tainted?Supportability and root-cause confidence change.
Is the feature built in or loaded as a module?lsmod is incomplete for built-in code.
Which context failed: syscall, workqueue, softirq, IRQ, kthread?Determines whether sleeping, locking, and allocation behavior are valid.
Is this host a cluster node?Kubelet, CNI, CSI, containerd, eBPF agents, and daemonsets may own the state.
Can the issue be reproduced on an untainted supported kernel?This is often the threshold for vendor or upstream action.

Primary References