eBPF Networking XDP TC Cilium and Service Dataplanes

Reading time
12 min read
Word count
2238 words
Diagram count
3 diagrams

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/linux-systems-engineering/15 eBPF Networking XDP TC Cilium and Service Dataplanes.md.

Purpose: Explain eBPF networking as programmable packet and socket handling across XDP, TC, socket hooks, cgroup hooks, service dataplanes, Cilium-style policy, and production debugging.

15 eBPF Networking XDP TC Cilium and Service Dataplanes

Related notes: Linux Systems Engineering, 05 Linux Networking TCP IP Routing Firewalling and DNS, 09 cgroups Namespaces Containers and Runtime Isolation, 14 eBPF Fundamentals Verifier Maps Programs and Helpers, 16 eBPF Observability Uprobes Kprobes Tracepoints and CO-RE, 17 Production Operations Troubleshooting and Runbooks, 18 Linux Ecosystem Tools and Learning Projects

eBPF networking programs run at packet, socket, and cgroup boundaries. They can drop, pass, redirect, classify, account, enforce policy, steer traffic, and implement service load balancing. The production value is early and precise decisions without sending every packet through a large user-space proxy. The production risk is equally direct: a bad return code, stale map, bad endpoint identity, incompatible kernel feature, or overloaded event path can become packet loss at line rate.

On a local learning machine, build a lab with network namespaces, veth pairs, a bridge, and throwaway XDP or TC programs. It is acceptable to drop all packets on a lab interface you can recreate. On production hosts and clusters, network eBPF is part of the dataplane. Roll it like firewall, routing, and CNI code: staged, observable, reversible, and tested on the same kernel and NIC mode that will run it.

Rendering diagram...

Hook Selection

HookPositionStrengthCost and risk
XDP nativedriver receive path before skb allocationfastest drop and redirect, DDoS filters, L4 load balancingdriver support varies, less kernel context, packet parsing must be careful
XDP genericfallback in kernel stackeasier local testingnot representative of native performance
TC ingressafter skb exists, before most higher-level handlingrich packet context, shaping and policy integrationlater than XDP, skb cost already paid
TC egressbefore transmitegress policy, encapsulation, accountingcan break return traffic and service paths
socket filterattached to socketsclassic filtering and capturenarrower scope than dataplane policy
sockmap and sockhashsocket redirection and verdictsL7-ish acceleration patterns, socket steeringcomplex semantics and harder debugging
cgroup socket hooksprocess group socket operationsworkload boundary enforcementdepends on correct cgroup placement

The right hook is the earliest hook that has enough context for a correct decision. If a decision needs process identity, cgroup, TLS metadata, or application state, raw XDP is probably too early. If a decision is "drop obvious garbage from these prefixes", XDP may be ideal.

XDP

XDP runs before skb allocation. Return actions commonly include pass, drop, abort, transmit back out the same interface, or redirect through maps. XDP is used for fast packet filtering, L3/L4 load balancing, early DDoS mitigation, AF_XDP delivery to user space, and steering between devices or CPUs.

XDP is not a full replacement for the network stack. It sees packet bytes and limited metadata early. If the policy depends on conntrack, socket state, route decisions, or application identity, use TC, socket, cgroup hooks, or a higher layer.

Packet parser discipline:

void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;

struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
    return XDP_PASS;

if (eth->h_proto != bpf_htons(ETH_P_IP))
    return XDP_PASS;

Every header read must be preceded by a data_end proof. VLANs, IPv6 extension headers, fragmentation, tunnels, and short packets turn simple examples into real parsers.

TC Ingress and Egress

TC BPF programs run in the traffic-control layer. They operate on skb context, so they are later than XDP but have more integration with the network stack. TC is a common place for Kubernetes CNI dataplanes, policy enforcement, traffic accounting, encapsulation, and service translation.

Example operator inspection:

tc qdisc show dev eth0
tc filter show dev eth0 ingress
tc filter show dev eth0 egress
sudo bpftool net
sudo bpftool prog show

Tradeoffs:

ChoiceUse whenAvoid when
XDP dropunwanted traffic can be identified from early packet headersdecision needs conntrack, process, or cgroup context
TC ingress policypod or host policy needs skb context and CNI integrationpackets should be discarded before skb allocation for survival
TC egress policyoutbound identity and destination control matterNAT and routing interactions are not understood
netfilter/nftablesruleset and conntrack semantics are enoughper-packet programmable maps or CNI dataplane integration is required

Socket and Cgroup Hooks

Socket hooks work closer to the socket abstraction than raw packets. Cgroup socket hooks let programs evaluate operations made by processes in a cgroup, such as bind, connect, sendmsg, recvmsg, sock options, and address selection depending on hook support. This is useful for workload boundaries because cgroups already model containers and services.

Use cgroup hooks when the policy is "this workload may connect to these destinations" or "this service may bind these ports." Use TC or XDP when the policy is about packet paths regardless of process ancestry.

Production cgroup cautions:

  • verify the cgroup hierarchy is v2 or expected hybrid mode
  • confirm container runtime placement before enforcement
  • handle host-network pods separately
  • define behavior for system daemons outside workload slices
  • test restarts because cgroup paths and IDs can change

Packet Filtering

Packet filtering with eBPF should be map-driven. Hardcoded rules are useful for demos, but production filters need user-space control planes, atomic map updates, observability, and rollback.

Rendering diagram...

Common policy dimensions:

DimensionExampleRisk
L2MAC, VLANspoofing, virtual switches, overlays
L3source or destination CIDRNAT and tunnels change visible addresses
L4TCP or UDP portsdynamic ports and protocols
identityendpoint ID, cgroup, pod labels via mapsstale identity maps can over-allow or over-deny
stateconnection tuple, reverse pathmap eviction and conntrack disagreement

Load Balancing and Service Dataplanes

eBPF service load balancers commonly replace a chain of iptables or IPVS decisions with map lookups at XDP or TC. A packet for a virtual service address is mapped to a backend endpoint, then rewritten or redirected. Reverse translation may be needed for return traffic.

Rendering diagram...

Service load balancing needs more than a hash map:

  • backend health and readiness
  • graceful termination handling
  • session affinity if required
  • external traffic policy semantics
  • source IP preservation decisions
  • NodePort, ClusterIP, LoadBalancer, and host-reachable service behavior
  • IPv4 and IPv6 parity
  • direct routing, tunneling, or hybrid routing mode

Kube-proxy Replacement Overview

kube-proxy traditionally programs iptables or IPVS to implement Kubernetes Service translation. An eBPF kube-proxy replacement moves service lookup and translation into BPF programs and maps, often attached at TC and sometimes XDP for selected paths. The expected win is less ruleset scaling pain and tighter integration with pod identity, policy, and observability.

The migration risk is dataplane ownership. During transition, iptables rules, conntrack state, node-local agents, CNI routes, kubelet behavior, load balancers, and BPF maps can disagree. In production clusters, change one node pool or node role at a time, drain workloads where required, and have a rollback path that includes dataplane state cleanup.

Cilium Overview

Cilium is a Kubernetes networking, security, and observability dataplane built on eBPF. It uses BPF programs and maps for service load balancing, network policy enforcement, endpoint identity, routing modes, and flow observability. It can run with kube-proxy or replace kube-proxy depending on configuration and environment support.

Do not reduce Cilium to "eBPF is faster." Its operational model includes an agent, operator, CNI integration, identity allocation, policy repository, endpoint regeneration, BPF map management, Hubble flow visibility, and cluster-specific routing decisions.

Cilium areaeBPF roleOperator concern
endpoint policyenforce allow and deny decisions near pod trafficidentity correctness and policy rollout
service handlingBPF maps for service and backend selectionkube-proxy replacement mode, health, termination
observabilityflow events similar to Hubble outputevent volume, dropped flow records, privacy
routingdirect routing, tunneling, or cloud integrationMTU, routes, encapsulation, node capabilities
host firewallpolicy at host boundaryavoid locking out control-plane and node agents

Network Policy Enforcement

Network policy enforcement has three planes:

PlaneResponsibilityFailure mode
intentKubernetes NetworkPolicy or richer policy APIpolicy does not express the real dependency
identitymap labels, services, pods, nodes, cgroups to numeric identitiesstale or missing identity
dataplaneenforce packets against maps at TC, XDP, or socket hookswrong hook, wrong map, wrong default

Default deny is powerful and dangerous. On a local cluster, it is a good exercise. In production, introduce visibility first, then narrow policies by namespace or workload, then enforce. Watch DNS, health checks, metrics scraping, node-to-pod traffic, and storage plugins.

Hubble-Style Flow Visibility

Hubble-style flow records summarize network decisions: source identity, destination identity, verdict, protocol, service, TCP flags, DNS names when visible, and policy reason. They are not the same as full packet capture.

Flow recordPacket capture
lower volume summaryfull bytes or headers
policy and identity contextexact wire evidence
easier cluster-wide searchexpensive at scale
may omit payload and timing detailsprivacy and storage risk

Use flow visibility for "who talked to whom and was it allowed." Use packet capture for "what exact bytes were on the wire." Use application telemetry for "what request was this and why did it fail."

Packet Capture Tradeoffs

tcpdump and AF_PACKET capture after some kernel path decisions. XDP drops may never appear in normal tcpdump on the host because the packet was discarded before skb allocation. TC drops may show differently depending on capture point. Hardware offloads can also confuse captures.

Production guidance:

  • know whether the suspected drop is before or after skb allocation
  • capture on both host and pod namespace when possible
  • disable offloads only in controlled windows if they obscure evidence
  • sample captures and redact payloads
  • prefer counters and flow logs for continuous operation

DDoS Mitigation Overview

XDP can drop obvious abusive traffic before expensive allocation and conntrack work. This is useful for volumetric garbage with simple match criteria: invalid ports, known bad prefixes, malformed headers, impossible protocol combinations, or allowlist-only service exposure.

Limits:

  • link saturation still happens before the host sees packets
  • complex L7 detection is not an XDP job
  • maps must update faster than attack shape changes
  • false positives at XDP are silent service denial
  • multi-node mitigation needs consistent policy and telemetry

For production DDoS response, combine upstream filtering, load balancer controls, rate limits, XDP filters, service autoscaling, and incident communication. XDP is one layer, not the whole defense.

Performance Tradeoffs

TechniqueBenefitCost
early XDP dropavoids skb allocation and later stack worklimited context and parser burden
per-CPU counterslow contentionhigher memory and user-space aggregation
LRU flow mapsbounded memory under churnevicts state, can break correlation
tail-call pipelinesplits logic and improves modularityharder control-flow debugging
event samplingkeeps user-space drain feasibleincomplete forensic detail
redirect mapsfast steeringNIC, driver, and map semantics matter

Measure with workload-specific traffic. Synthetic packets can prove parser behavior but rarely prove production latency, CPU, and drop behavior.

Safety Limits

Safety for eBPF networking is broader than verifier safety.

LimitWhy it matters
verifier acceptedproves constrained memory safety, not correct networking policy
map capacitydetermines whether flows, services, or identities can be represented
event buffer sizedetermines visibility under burst
CPU budgetpacket path work multiplies by packet rate
feature supportnative XDP, helpers, BTF, and attach types vary
offload behaviordriver and hardware paths may differ from generic mode
rollback pathpinned maps and links can survive processes

Debugging eBPF Networking Programs

Start by locating the hook and owner.

sudo bpftool net
sudo bpftool prog show
sudo bpftool map show
sudo bpftool link show
ip link show
tc qdisc show dev eth0
tc filter show dev eth0 ingress
tc filter show dev eth0 egress

Then compare counters at each layer:

ip -s link show dev eth0
ethtool -S eth0 | egrep 'drop|err|miss|xdp|rx|tx'
nstat -az | egrep 'Tcp|Udp|Ip|Icmp'
ss -tin
conntrack -S 2>/dev/null

For clusters:

kubectl get nodes -o wide
kubectl -n kube-system get pods -o wide
kubectl -n kube-system logs ds/cilium --tail=200
cilium status
cilium bpf map list
cilium monitor
hubble observe --verdict DROPPED

Use vendor tools when a CNI owns the dataplane. Manual tc or bpftool inspection is valuable, but deleting programs behind a controller can make state worse.

Troubleshooting Decision Table

ObservationLikely layerNext action
tcpdump sees nothing, NIC counters increaseXDP or driver pathcheck XDP attach, XDP stats, driver mode
pod-to-pod fails only across nodesrouting, overlay, service map, MTUcompare direct routing or tunnel config, node routes, CNI maps
service IP fails but pod IP worksservice load balancinginspect service maps, kube-proxy replacement status, endpoint readiness
DNS blocked after policy rolloutnetwork policyinspect flow verdicts for kube-dns/CoreDNS and port 53
only new connections failconntrack, service backend selection, policycompare SYN path, conntrack table, service maps
high CPU in dataplane agentmap churn, endpoint regeneration, event volumeinspect agent logs, map pressure, flow export rate
verifier rejection after upgradekernel or compiler differencecapture verifier log and feature probe target kernel

Local Labs

Local experiments that teach real concepts:

  • attach XDP generic to a veth and drop ICMP
  • count TCP SYN packets in a per-CPU array
  • redirect packets between veth peers with a devmap
  • attach TC ingress to classify DNS traffic
  • use a cgroup connect hook to deny one destination from a test process
  • run a local Cilium kind cluster and compare kube-proxy and kube-proxy-free mode only in disposable environments

Keep the lab separate from your daily network interface. A wrong XDP return value on Wi-Fi or the primary Ethernet device can cut off your session.

Production Guidance

Before rollout:

  • list exact program types, attach points, and maps
  • feature-probe kernel families and NIC driver modes
  • size maps from expected services, endpoints, identities, and flows
  • define default behavior when maps are missing or full
  • define rollback that detaches programs and cleans pinned maps
  • test upgrades and node reboots

During incidents:

  • avoid deleting unknown BPF programs until ownership is known
  • gather bpftool net, bpftool prog, bpftool map, tc, routes, and CNI status first
  • compare packet counters before and after the suspected hook
  • use sampled flow visibility before broad packet capture
  • treat control-plane nodes as higher risk

eBPF networking is most successful when it is operated as a dataplane with a control plane, not as clever packet code living on a host.