Linux Security Hardening Secrets and Incident Response

Reading time
12 min read
Word count
2282 words
Diagram count
1 diagram

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/linux-systems-engineering/12 Linux Security Hardening Secrets and Incident Response.md.

Purpose: Build a production-focused Linux hardening and incident response manual that extends the permissions and LSM model into service hardening, secrets handling, SSH, auditd, patching, CVE response, and bounded incident commands.

12 Linux Security Hardening Secrets and Incident Response

Related notes: 08 Permissions Users Groups Capabilities and LSMs, 07 systemd Boot Init Units Timers Journald and Services, 09 cgroups Namespaces Containers and Runtime Isolation, 10 Observability Logs Metrics Tracing and Debugging, 11 Performance Engineering perf Flamegraphs and Capacity, 05 Linux Networking TCP IP Routing Firewalling and DNS, 17 Production Operations Troubleshooting and Runbooks

This note assumes the base model in 08 Permissions Users Groups Capabilities and LSMs: Linux security starts with users, groups, mode bits, capabilities, namespaces, seccomp, and LSMs. This note focuses on hardening and response. The practical goal is to reduce attacker freedom, preserve operator access, keep secrets out of places they do not belong, patch with discipline, and collect evidence without destroying the state needed for root cause.

On a local learning machine, it is acceptable to break SSH, experiment with audit rules, mount debug interfaces, run vulnerable services in throwaway labs, and practice incident commands. On production Linux hosts, hardening must be staged, reversible, observable, and compatible with recovery access. On production clusters, host security includes kubelet, container runtime, CNI, CSI, node credentials, service account tokens, image supply chain, admission policy, and the fact that every container shares the node kernel.

Rendering diagram...

Security Posture by Environment

EnvironmentBiasAcceptable experimentsProduction boundary
Local learning machinelearn by breaking and rebuildingpermissive audit rules, SSH lockout recovery, bpftrace, vulnerable labsdo not treat local root habits as fleet practice
Production Linux hostleast privilege, recoverability, evidencestaged hardening, canary audit rules, controlled packet captureavoid unreviewed privilege, broad debug surfaces, and destructive cleanup
Production clusternode plus orchestrator securitypolicy dry runs, canary nodes, runtime profilescontainers share the kernel and cluster credentials expand blast radius

Hardening is only useful when operators can still deploy, rotate, patch, recover, and investigate. A host that is locked down but impossible to patch or inspect will fail under real incidents.

Hardening Layers

LayerControl examplesFailure if ignored
Identitydedicated service users, sudo policy, PAM, MFA upstreamshared accounts and weak attribution
Filesystemownership, mode bits, mount options, immutable baselineswritable config, secret leakage, persistence
Process privilegecapabilities, NoNewPrivileges, seccomp, LSMsroot-equivalent service compromise
Service managersystemd sandboxing, resource limits, restart policydaemon escape, noisy failure, weak recovery
NetworkSSH policy, firewall, bind addresses, segmentationexposed admin planes and lateral movement
Secretsexternal secret store, short lifetime, rotation, redactionlong-lived credentials in files and logs
Auditauditd rules, journal, EDR, file integrityno evidence after compromise
PatchCVE triage, kernel and package updates, reboot processknown exploit window remains open
Incident responsecontainment, evidence, recovery, lessonsdestructive panic and repeated compromise

Service Hardening with systemd

systemd service hardening is a practical way to apply kernel controls per daemon. It does not replace application security, but it can reduce what a compromised process can read, write, execute, or ask the kernel to do.

Example service profile:

[Service]
User=example
Group=example
UMask=0077
NoNewPrivileges=yes
PrivateTmp=yes
PrivateDevices=yes
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths=/var/lib/example /var/log/example
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
AmbientCapabilities=CAP_NET_BIND_SERVICE
SystemCallFilter=@system-service
SystemCallArchitectures=native
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
RestrictNamespaces=yes
LockPersonality=yes
MemoryDenyWriteExecute=yes
ProtectKernelTunables=yes
ProtectKernelModules=yes
ProtectControlGroups=yes
RestrictSUIDSGID=yes

Tradeoffs:

ControlBenefitRisk
User= and Group=removes default root executionfile ownership and low-port binding need planning
NoNewPrivileges=yesblocks privilege gain through execbreaks programs that rely on setuid transitions
ProtectSystem=strictmakes most OS paths read-onlyrequires explicit writable paths
PrivateTmp=yesisolates temporary filesbreaks sharing through /tmp
CapabilityBoundingSet=limits kernel privilege bitswrong capability set breaks legitimate operations
SystemCallFilter=reduces syscall surfaceincomplete profiles fail at runtime
RestrictAddressFamilies=narrows network protocol usebreaks DNS, Unix sockets, or IPv6 if omitted
ProtectKernelModules=yesblocks module loading by servicenot useful if service never had that privilege

Production workflow:

  1. Run systemd-analyze security example.service for a rough exposure review.
  2. Add controls in small batches.
  3. Test under representative workload.
  4. Check journalctl -u example.service for sandbox denials or startup failures.
  5. Record why each exception exists.
  6. Keep a rollback drop-in ready.

Common mistake: copying a maximal hardening block into every service. A web server, backup agent, hardware monitor, database, and container runtime need different access. Harden from the service contract, not from a generic checklist.

Secrets Handling

A secret is any value that grants access or proves identity: passwords, API keys, private keys, tokens, cookies, database URLs, cloud credentials, kubeconfigs, signing keys, recovery codes, and session material.

Rules:

  • keep secrets out of Git, shell history, screenshots, tickets, and logs
  • prefer a managed secret store with audit, access policy, and rotation
  • use short-lived credentials where practical
  • scope secrets to the smallest service, tenant, environment, and action
  • rotate after exposure, role change, host compromise, or suspicious access
  • avoid putting secrets in command-line arguments because process listings can expose them
  • treat packet captures, heap dumps, core dumps, and debug logs as secret-bearing artifacts

Storage tradeoffs:

LocationUseRisk
environment variablecommon for simple service configinherited by children, visible in some process contexts, dumped in diagnostics
root-owned file 0600stable host secretbackup and file permission risk
systemd credentialbetter unit-scoped secret delivery on supporting systemsversion and operational support vary
external secret manageraudit, rotation, centralized policyavailability and bootstrap dependency
Kubernetes Secretintegrates with cluster workloadsbase64 is not encryption; node and RBAC access matter
command argumentalmost never justifiedvisible in process listings and shell history

Incident response for exposed secrets:

  1. Identify exact secret and scope.
  2. Revoke or rotate it.
  3. Search logs, repos, tickets, and artifacts for copies.
  4. Review access logs for use before and after exposure.
  5. Replace deployment paths that reintroduce the old value.
  6. Document blast radius and residual risk.

SSH Hardening

SSH is usually the highest-value administrative path on a Linux host. Harden it without locking out recovery.

Baseline directions:

PermitRootLogin no
PasswordAuthentication no
KbdInteractiveAuthentication no
PubkeyAuthentication yes
PermitEmptyPasswords no
AllowUsers alice bob
X11Forwarding no
AllowTcpForwarding no
PermitTunnel no
ClientAliveInterval 300
ClientAliveCountMax 2
LogLevel VERBOSE

Use sshd -t before reload:

sudo sshd -t
sudo systemctl reload sshd

Production workflow:

  • keep an existing root or admin session open while changing SSH
  • confirm console or out-of-band recovery
  • use drop-in config where the distribution supports it
  • deploy to canaries before fleet rollout
  • log authentication centrally
  • prefer hardware-backed or centrally managed keys for sensitive fleets
  • remove stale authorized keys
  • restrict bastion access and port forwarding

Common mistakes:

MistakeConsequenceCorrection
disabling passwords without key validationlockouttest a new session before closing old one
allowing root login for conveniencelarger brute-force and post-compromise impactuse named accounts plus sudo
unmanaged authorized_keysstale accesscentral inventory and rotation
broad agent forwardingcredential theft pathavoid or restrict agent forwarding
SSH from every networkexposed admin surfacebind, firewall, VPN, bastion, or zero trust access

In production clusters, SSH may be intentionally disabled on nodes. That is fine only if there is a supported node debug, console, or break-glass path.

auditd

auditd is the userspace component of the Linux Audit system. It writes audit records to disk, while rules are loaded into the kernel through auditctl or rule files. auditd is not a complete detection platform, but it is useful for high-value host evidence: identity changes, sudoers edits, secret file reads, module loads, time changes, audit config changes, and suspicious exec paths.

Commands:

sudo auditctl -s
sudo auditctl -l
sudo ausearch -m USER_LOGIN --success no -i
sudo ausearch -k identity -i
sudo aureport --summary

Example rules:

-w /etc/passwd -p wa -k identity
-w /etc/group -p wa -k identity
-w /etc/shadow -p wa -k identity
-w /etc/sudoers -p wa -k privilege
-w /etc/sudoers.d/ -p wa -k privilege
-w /etc/ssh/sshd_config -p wa -k ssh_config
-a always,exit -F arch=b64 -S init_module,finit_module,delete_module -k kernel_modules
-a always,exit -F arch=b64 -S adjtimex,settimeofday,clock_settime -k time_change

Tradeoffs:

Audit choiceBenefitCost
watch high-value filesclear evidence of config tamperingmisses equivalent changes elsewhere
audit execve broadlystrong process evidencehigh volume and sensitive arguments
immutable audit rulesharder attacker tamperingharder emergency changes
central forwardingpreserves evidence after host lossnetwork and collector dependency

Production guidance:

  • test audit rules under load
  • watch for backlog drops
  • avoid broad path watches on high-churn trees
  • protect audit logs from local deletion through forwarding or immutable storage
  • include audit rule changes in change control

Patching and CVE Response

Patch response is operational risk management. Not every CVE has the same exposure, exploitability, or mitigation path. Linux distributions often backport fixes without changing upstream version numbers, so version checks must account for distro advisories and package changelogs.

CVE triage:

QuestionWhy it matters
Is the affected package or kernel code present?avoids false positives
Is the vulnerable feature enabled or reachable?exposure depends on config and workload
Is exploitation local, remote, authenticated, or privileged?drives urgency
Is there a public exploit or active exploitation?changes response priority
Is a vendor fix available for this distro release?determines patch path
Does mitigation exist before patching?buys time but may reduce functionality
Does patching require restart or reboot?affects maintenance and failover

Commands:

uname -a
cat /etc/os-release
systemctl list-units --type=service --state=running --no-pager
rpm -qa --last 2>/dev/null | head
dpkg-query -W 2>/dev/null | head
needrestart -r a 2>/dev/null

Production patch workflow:

  1. Confirm exposure using distro advisories, package state, kernel config, and feature use.
  2. Choose mitigation, patch, isolation, or shutdown.
  3. Patch canary hosts first.
  4. Validate service health and security control health.
  5. Roll through the fleet with monitoring.
  6. Reboot when kernel, libc, OpenSSL, container runtime, or other in-memory components require it.
  7. Record residual risk and exceptions.

Kernel CVEs deserve special care because containers share the host kernel. A container escape or local privilege escalation on a node can become a cluster incident if node credentials, kubelet permissions, or cloud metadata are reachable.

Incident Response Principles

Security incidents need containment and evidence, not panic cleanup. The wrong command can destroy forensic state, rotate logs, kill the only process that shows the attack path, or tip off an attacker before containment.

Phases:

PhaseGoalExamples
identifydetermine whether abnormal activity is security-relevantsuspicious process, new user, odd network connection
containstop spread or damageisolate host, revoke secret, block route
preservecapture volatile evidenceprocess, network, logs, memory policy if available
eradicateremove persistence and vulnerabilitypatch, rebuild, remove unauthorized access
recoverrestore service safelyredeploy, rotate, monitor
learnprevent repeatcontrols, alerts, runbooks

Production rule: prefer rebuild from trusted images over hand-cleaning a compromised host. Cleaning may be useful for learning, but recovery should assume the host is untrusted until reimaged or otherwise verified through an approved process.

Incident Commands

Use these as starting points. Record time, host, operator, command, and output destination.

Host identity and time:

date -Is
hostnamectl
who -a
w
last -a | head -n 30
lastb -a | head -n 30

Process and service state:

systemctl --failed --no-pager
systemctl list-units --type=service --state=running --no-pager
ps -eo pid,ppid,user,group,state,lstart,comm,args --sort=pid
pstree -ap

Network state:

ss -tulpen
ss -tanp
ip addr
ip route
ip rule
nft list ruleset 2>/dev/null
iptables-save 2>/dev/null

Persistence checks:

crontab -l 2>/dev/null
sudo ls -la /etc/cron* /var/spool/cron 2>/dev/null
systemctl list-timers --all --no-pager
find /etc/systemd/system -type f -mtime -30 -ls
find /usr/local/bin /usr/local/sbin -type f -mtime -30 -ls 2>/dev/null

Identity and privilege:

getent passwd
getent group
sudo -l -U example 2>/dev/null
find / -xdev \( -perm -4000 -o -perm -2000 \) -type f -printf '%m %u %g %p\n' 2>/dev/null
getcap -r / 2>/dev/null

Logs:

journalctl --since '24 hours ago' --no-pager
journalctl -p warning..alert --since '24 hours ago' --no-pager
journalctl -u sshd --since '24 hours ago' --no-pager
journalctl -k --since '24 hours ago' --no-pager
sudo ausearch --start today -i 2>/dev/null

Filesystem triage:

find /tmp /var/tmp /dev/shm -xdev -type f -mtime -7 -ls 2>/dev/null
find / -xdev -type f -mtime -1 -ls 2>/dev/null | head -n 200
find / -xdev -type f -perm -0002 -ls 2>/dev/null | head -n 200

Container and cluster node context:

crictl ps 2>/dev/null
crictl pods 2>/dev/null
ctr -n k8s.io containers list 2>/dev/null
systemctl status kubelet --no-pager 2>/dev/null
journalctl -u kubelet --since '2 hours ago' --no-pager 2>/dev/null

Do not run destructive cleanup commands such as deleting unknown files, killing unknown processes, clearing logs, flushing firewall state, or rotating all credentials until containment strategy is agreed.

Suspicious Findings

FindingPossible benign causeSecurity concern
unknown listening portnew service, debug serverbackdoor, exposed admin API
new setuid filepackage updateprivilege escalation persistence
unexpected capability on binarypackage featureroot-equivalent helper
shell history disabledprivacy configanti-forensics
deleted executable still runningpackage upgradememory-only implant
SSH login from unusual ASNtravel, VPN changecredential compromise
audit backlog dropsload spikeevidence loss or evasion
unknown BPF programobservability agentrootkit or policy bypass

For unknown BPF programs:

sudo bpftool prog show
sudo bpftool map show
sudo bpftool link show

Do not unload BPF programs before identifying whether they belong to CNI, observability, security, or traffic control. In clusters, removing the wrong program can break networking.

Hardening Checklist

AreaProduction control
accountsno shared human accounts, dedicated service users, reviewed privileged groups
sudoleast privilege rules, no writable scripts, logged elevation
SSHno root login, key-based auth, tested reloads, restricted forwarding
servicesnon-root where possible, systemd sandboxing, resource limits
filesystemcorrect ownership, no world-writable service paths, setuid inventory
secretssecret store, scoped credentials, rotation, no command-line secrets
logspersistent enough for incidents, centrally forwarded, redaction policy
audithigh-value file watches, module and time-change rules, backlog monitoring
patchingCVE triage, canary patches, reboot discipline
cluster nodesrestricted debug access, kubelet and runtime hardening, node credential protection

Common Mistakes

MistakeImpactBetter practice
hardening without recovery pathoperator lockoutconsole, break-glass, canary rollout
storing secrets in .env files forevercredential sprawlsecret manager and rotation
treating auditd as magic detectionfalse confidencetargeted rules plus central analysis
patching only packages, never rebootingvulnerable code remains in memoryrestart or reboot based on affected component
clearing logs during incidentevidence destructionpreserve and restrict access
editing SSH on all hosts at oncefleet lockoutcanary and test new sessions
giving services CAP_SYS_ADMINroot-like powerredesign or use narrower capabilities
debugging containers as sandboxesmissed kernel and node risktreat containers as processes sharing the host kernel

Troubleshooting Hardening Failures

Service fails after hardening:

systemctl status example.service --no-pager
journalctl -u example.service -b --no-pager
systemd-analyze security example.service

Check:

  • missing writable path after ProtectSystem=strict
  • blocked syscall after SystemCallFilter
  • missing address family after RestrictAddressFamilies
  • missing capability after CapabilityBoundingSet
  • denied home access after ProtectHome
  • temp file sharing broken by PrivateTmp
  • LSM denial in audit or journal logs

SSH reload fails:

sudo sshd -t
sudo journalctl -u sshd -b --no-pager
sudo systemctl reload sshd

Check:

  • syntax error
  • unsupported directive for distro version
  • match block ordering
  • include directory precedence
  • PAM or authorized keys path issue

Audit volume too high:

sudo auditctl -s
sudo aureport --summary
sudo ausearch --start recent -i | head

Check:

  • broad exec rules
  • watched high-churn directories
  • missing filters by architecture, UID, or path
  • backlog limit and rate settings
  • central collector throughput

Production Guidance

Hardening changes should be deployed like reliability changes:

  • define expected behavior before rollout
  • stage on canary hosts
  • keep an active rollback path
  • monitor logs, audit backlog, service health, and support tickets
  • document exceptions
  • revisit exceptions after incidents and upgrades

Incident response should be practiced on local and staging systems. The first time an operator runs ausearch, bpftool, ss, journalctl, or recovery console steps should not be during a live compromise.

Reference Anchors

  • systemd.exec defines many service sandboxing and privilege controls.
  • sshd_config defines OpenSSH server authentication, login, forwarding, and access directives.
  • Linux audit man pages define auditd, auditctl, ausearch, and audit rule handling.
  • Linux kernel security bug documentation describes kernel security reporting and coordination.
  • Linux man pages for capabilities, seccomp, ptrace, and syscalls explain privilege and tracing boundaries.
  • systemd journal documentation supports incident log collection and time-bounded queries.
  • bpftool documentation supports inspection of BPF programs, maps, and links during security triage.