Hardening Linux servers running GPU inference and training workloads. Covers SSH lockdown, Docker rootless mode, NVIDIA driver security, systemd sandboxing, audit logging, and network segmentation for AI infrastructure.
Tyler McDaniel
AI Engineer & IBM Business Partner
GPU servers running inference workloads are some of the most valuable targets on any network. They have expensive hardware, they process sensitive data (the whole reason you're self-hosting instead of using an API), and they're often set up by ML engineers who optimize for CUDA drivers, not security baselines. I've audited GPU servers at three organizations and found the same pattern every time: root SSH with password auth, Docker running as root with the socket exposed to all users, no firewall rules beyond the cloud provider's security group, and NVIDIA drivers installed from a random .run file downloaded over HTTP.
Linux server hardening for AI workloads follows the same principles as any server hardening — least privilege, defense in depth, audit everything — but the GPU stack introduces specific attack surface that generic hardening guides miss.
Key-only SSH authentication is the floor, not the ceiling.
Edit /etc/ssh/sshd_config:
# Disable password and root login
PermitRootLogin no
PasswordAuthentication no
KbdInteractiveAuthentication no
UsePAM yes
Restrict to specific users/groups
AllowGroups ssh-users
Use only strong key exchange and ciphers
KexAlgorithms sstk-nistp521-sha512@openssh.com,curve25519-sha256,curve25519-sha256@libssh.org
Ciphers chacha20-poly1305@openssh.com,aes256-gcm@openssh.com,aes128-gcm@openssh.com
MACs hmac-sha2-512-etm@openssh.com,hmac-sha2-256-etm@openssh.com
Limit authentication attempts
MaxAuthTries 3
LoginGraceTime 30
Disable X11, TCP, and agent forwarding by default
X11Forwarding no
AllowTcpForwarding no
AllowAgentForwarding no
Log verbosely
LogLevel VERBOSE
Close idle connections
ClientAliveInterval 300
ClientAliveCountMax 2
After editing:
# Validate config before restarting
sudo sshd -t
Restart
sudo systemctl restart sshd
Certificate-based SSH is the next level. Instead of distributing public keys to every server, you sign short-lived certificates with a CA. The server trusts the CA; clients present certificates that expire in 8-24 hours. When someone leaves the team, you don't have to scrub their key from 50 servers — their certificates just stop being issued.
# On your CA machine:
Sign a user's key for 8 hours
ssh-keygen -s /path/to/ca_key -I "tyler@company" -n tyler -V +8h ~/.ssh/id_ed25519.pub
On the server — trust the CA
echo "TrustedUserCAKeys /etc/ssh/ca.pub" >> /etc/ssh/sshd_config
[Smallstep](https://smallstep.com/docs/ssh/) and [Teleport](https://goteleport.com/) automate this. For teams of 3+, certificate-based SSH is worth the setup cost.
GPU servers typically expose an inference API (port 8000, 8080, 11434 for Ollama) plus SSH. Everything else should be blocked.
Using ufw (Uncomplicated Firewall):
# Reset to clean state
sudo ufw reset
Default deny incoming, allow outgoing
sudo ufw default deny incoming
sudo ufw default allow outgoing
Allow SSH from specific IP range only (your office/VPN)
sudo ufw allow from 10.0.0.0/24 to any port 22 proto tcp
Allow inference API from internal network only
sudo ufw allow from 10.0.0.0/24 to any port 8000 proto tcp
sudo ufw allow from 10.0.0.0/24 to any port 8080 proto tcp
If using Prometheus node exporter for monitoring
sudo ufw allow from 10.0.1.0/24 to any port 9100 proto tcp
Enable
sudo ufw enable
sudo ufw status verbose
For production, use iptables or nftables directly for more control:
# /etc/iptables/rules.v4
*filter
:INPUT DROP [0:0]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [0:0]
Allow established connections
-A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
Allow loopback
-A INPUT -i lo -j ACCEPT
Allow SSH from VPN only
-A INPUT -p tcp --dport 22 -s 10.0.0.0/24 -j ACCEPT
Allow inference API from internal network
-A INPUT -p tcp --dport 8000 -s 10.0.0.0/24 -j ACCEPT
-A INPUT -p tcp --dport 8080 -s 10.0.0.0/24 -j ACCEPT
Allow ICMP (ping) from internal
-A INPUT -p icmp -s 10.0.0.0/24 -j ACCEPT
Log and drop everything else
-A INPUT -j LOG --log-prefix "IPT_DROP: " --log-level 4
-A INPUT -j DROP
COMMIT
Never expose inference ports to the public internet. Even with API key authentication, an exposed LLM endpoint is an invitation for prompt injection attacks, resource exhaustion, and data exfiltration. Put it behind a reverse proxy (nginx, Caddy) with rate limiting, or behind a VPN.
Docker is how most ML engineers deploy inference. [The NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) (nvidia-ctk) exposes GPUs to containers. The default configuration has significant security gaps.
Docker traditionally requires root. A container escape with root Docker gives the attacker root on the host. [Rootless Docker](https://docs.docker.com/engine/security/rootless/) runs the daemon as a regular user:
# Install rootless Docker
dockerd-rootless-setuptool.sh install
Configure NVIDIA runtime for rootless mode
nvidia-ctk runtime configure --runtime=docker --config=$HOME/.config/docker/daemon.json
systemctl --user restart docker
Rootless Docker with GPU access landed in NVIDIA Container Toolkit 1.14+. It works. Use it.
The Docker socket (/var/run/docker.sock) is equivalent to root access. Anyone who can write to it can start a privileged container that mounts the host filesystem. Don't add users to the docker group casually.
# Check who has Docker access
getent group docker
If someone shouldn't be there
sudo gpasswd -d username docker
For CI/CD pipelines that need to build containers, use a Docker-in-Docker setup or a remote Docker host rather than mounting the socket.
# docker-compose.yml hardening
services:
inference:
image: your-inference-image:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
read_only: true # Read-only root filesystem
tmpfs: # Writeable temp directories as tmpfs
- /tmp:size=1G
- /var/tmp:size=512M
security_opt:
- no-new-privileges:true # Prevent privilege escalation
cap_drop:
- ALL # Drop all Linux capabilities
cap_add:
- SYS_NICE # Needed for GPU priority scheduling
user: "1000:1000" # Run as non-root user
networks:
- internal # Isolated network
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
networks:
internal:
internal: true # No external access by default
Key settings:
read_only: true prevents the container from writing to its filesystem. If an attacker gets code execution, they can't persist malware.no-new-privileges blocks setuid binaries and privilege escalation.cap_drop: ALL removes all Linux capabilities. GPU access doesn't need NET_RAW, SYS_ADMIN, or any of the other capabilities that enable container escapes.SYS_NICE is needed for CUDA's thread priority management. Without it, GPU scheduling can degrade.The NVIDIA kernel module (nvidia.ko) runs with kernel privileges. A vulnerability in the driver is a direct path to root. This is not theoretical — NVIDIA has issued [security bulletins](https://www.nvidia.com/en-us/security/) for driver vulnerabilities that allow local privilege escalation.
Keep drivers updated. Check for security updates monthly:
# Check current driver version
nvidia-smi --query-gpu=driver_version --format=csv,noheader
On Ubuntu, update via package manager (preferred over .run files)
sudo apt update
sudo apt install --only-upgrade nvidia-driver-550
Never install drivers from .run files in production. The runfile installer bypasses the package manager, doesn't receive security updates automatically, and can break on kernel updates. Use your distro's package manager or the [CUDA repository](https://developer.nvidia.com/cuda-downloads).
Restrict /dev/nvidia* access. By default, NVIDIA device files are world-readable. Limit access to users who need GPU:
# /etc/udev/rules.d/70-nvidia.rules
KERNEL=="nvidia*", GROUP="gpu-users", MODE="0660"
KERNEL=="nvidiactl", GROUP="gpu-users", MODE="0660"
KERNEL=="nvidia-uvm", GROUP="gpu-users", MODE="0660"
sudo groupadd gpu-users
sudo usermod -aG gpu-users inference-user
sudo udevadm control --reload-rules
sudo udevadm trigger
Now only members of gpu-users can access the GPU. Your inference container runs as inference-user; other system users can't enumerate or use the GPU.
If you run inference directly (without Docker), systemd unit hardening is critical:
# /etc/systemd/system/inference.service
[Unit]
Description=LLM Inference API
After=network-online.target nvidia-persistenced.service
Wants=network-online.target
[Service]
Type=exec
User=inference
Group=inference
ExecStart=/opt/inference/venv/bin/uvicorn main:app --host 127.0.0.1 --port 8000
Filesystem restrictions
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/opt/inference/cache
PrivateTmp=true
PrivateDevices=false # false — need GPU device access
DeviceAllow=/dev/nvidia0 rw
DeviceAllow=/dev/nvidiactl rw
DeviceAllow=/dev/nvidia-uvm rw
Network restrictions
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
IPAddressDeny=any
IPAddressAllow=10.0.0.0/24
IPAddressAllow=127.0.0.0/8
Privilege restrictions
NoNewPrivileges=true
CapabilityBoundingSet=
AmbientCapabilities=
ProtectKernelModules=true
ProtectKernelTunables=true
ProtectControlGroups=true
Process restrictions
MemoryMax=32G
CPUQuota=400% # Limit to 4 CPU cores
TasksMax=128
Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=inference
[Install]
WantedBy=multi-user.target
The key directives:
ProtectSystem=strict makes the entire filesystem read-only except explicitly allowed pathsDeviceAllow whitelists only the specific NVIDIA devices neededIPAddressDeny=any + IPAddressAllow restricts network access at the systemd level (defense in depth with the firewall)MemoryMax=32G prevents the inference process from consuming all system RAM (important when GPU memory spills to system RAM)NoNewPrivileges=true prevents privilege escalationEvery inference request should be logged with enough context to answer: who asked what, when, and what was the response?
On the application side (your [FastAPI inference proxy](https://tostupidtooquit.com/blog/self-hosting-llms-fastapi)):
import logging
import json
import time
from hashlib import sha256
audit_logger = logging.getLogger("audit")
audit_handler = logging.FileHandler("/var/log/inference/audit.jsonl")
audit_handler.setFormatter(logging.Formatter("%(message)s"))
audit_logger.addHandler(audit_handler)
audit_logger.setLevel(logging.INFO)
def log_inference_request(
api_key_hash: str,
model: str,
message_count: int,
input_tokens: int,
output_tokens: int,
duration_ms: float,
source_ip: str,
):
audit_logger.info(json.dumps({
"timestamp": time.time(),
"event": "inference_request",
"api_key_hash": api_key_hash[:16], # Truncated hash for identity
"model": model,
"message_count": message_count,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"duration_ms": round(duration_ms, 2),
"source_ip": source_ip,
}))
Never log the actual prompt or response in the audit log by default. If you're self-hosting for data sovereignty, logging the content defeats the purpose. Log metadata only. Enable content logging only in debug mode with explicit opt-in.On the system side, use auditd to track file access to model weights:
# /etc/audit/rules.d/inference.rules
Log all access to model files
-w /opt/models/ -p rwa -k model_access
Log all Docker/container commands
-w /usr/bin/docker -p x -k docker_commands
-w /usr/bin/containerd -p x -k container_runtime
Log GPU device access
-w /dev/nvidia0 -p rw -k gpu_access
sudo augenrules --load
sudo systemctl restart auditd
Query audit logs:
# Who accessed model files in the last hour?
ausearch -k model_access -ts recent
What Docker commands were run?
ausearch -k docker_commands -ts today
If you're running multiple GPU servers (a cluster for distributed inference or training), segment the network:
┌─────────────────────────────────────────────────┐
│ Public Internet │
└──────────────────────┬──────────────────────────┘
│
┌────────▼────────┐
│ Reverse Proxy │ (nginx/Caddy)
│ 10.0.0.2 │ Port 443 only
└────────┬────────┘
│
┌─────────────▼─────────────┐
│ Application Network │
│ 10.0.1.0/24 │
│ (API servers, queues) │
└─────────────┬─────────────┘
│
┌─────────────▼─────────────┐
│ GPU Network │
│ 10.0.2.0/24 │
│ (inference servers) │
│ NO internet access │
└───────────────────────────┘
The GPU network has no default route to the internet. It can only communicate with the application network for receiving inference requests and sending responses. Model weights are loaded during deployment (pulled from internal storage), not downloaded at runtime.
For multi-node inference (tensor parallelism across GPUs on different machines), the GPU-to-GPU communication uses NCCL over a dedicated high-speed network (InfiniBand or RoCE). This fabric should be on a completely separate VLAN with no routing to any other network.
| Framework | Scope | GPU-Specific Guidance | Automation | Best For |
|-----------|-------|---------------------|------------|----------|
| [CIS Benchmarks](https://www.cisecurity.org/cis-benchmarks) | OS-level hardening | No | CIS-CAT scanner | General Linux hardening baseline |
| [NIST SP 800-53](https://csf.tools/reference/nist-sp-800-53/) | Federal/compliance | No | OSCAP profiles | Regulated industries, compliance |
| [DISA STIG](https://public.cyber.mil/stigs/) | DoD requirements | No | STIG Viewer + scripts | Government/defense contractors |
| [Docker CIS Benchmark](https://www.cisecurity.org/benchmark/docker) | Container security | Partial | docker-bench-security | Docker deployments |
| [NVIDIA Security Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/security-best-practices.html) | GPU container security | Yes | Manual | GPU workloads specifically |
My recommendation: start with the CIS Benchmark for your distro (Ubuntu, RHEL, etc.), layer on the Docker CIS Benchmark if using containers, and add the NVIDIA security recommendations for GPU-specific hardening. NIST and STIG are necessary for compliance contexts but overkill for most startups.
Run [docker-bench-security](https://github.com/docker/docker-bench-security) to check your Docker configuration:
git clone https://github.com/docker/docker-bench-security.git
cd docker-bench-security
sudo sh docker-bench-security.sh
It will output a list of PASS/WARN/FAIL checks. Fix everything marked WARN or FAIL before exposing the server to production traffic.
If you're running [self-hosted LLMs with FastAPI](https://tostupidtooquit.com/blog/self-hosting-llms-fastapi), everything in this post applies to that infrastructure. And for the agentic systems that build on top of local inference, [Agentic AI: Multi-Agent Systems](https://tostupidtooquit.com/blog/agentic-ai-multi-agent-systems) covers the orchestration layer — but none of it matters if the servers running the agents aren't locked down.
---
Running quantized LLMs behind a FastAPI proxy with Ollama and vLLM backends. Covers model quantization tradeoffs, GGUF vs GPTQ vs AWQ, streaming responses, request queuing, Docker Compose deployment, and production monitoring.
Compiling Rust to WebAssembly for real browser performance wins. Includes image filter benchmarks (grayscale, box blur), SIMD optimization, JS-Wasm boundary analysis, bundle size strategies, and Next.js integration.
A deep dive into LTI 1.3 — the OIDC-based protocol that connects learning tools to Canvas, Moodle, and Blackboard. Covers the three-step launch flow, JWT anatomy, ltijs implementation, NRPS roster access, and AGS grade passback.