>Services >Proof >Projects >About >Contact

Loading…

>Services >Proof >Projects >About >Contact

systems

Linux Server Hardening for AI Workloads: The Security Guide Nobody Wrote

Hardening Linux servers running GPU inference and training workloads. Covers SSH lockdown, Docker rootless mode, NVIDIA driver security, systemd sandboxing, audit logging, and network segmentation for AI infrastructure.

Tyler McDaniel

AI Engineer & IBM Business Partner

Feb 24, 202622 min read

#linux#security#docker#nvidia#hardening

GPU servers running inference workloads are some of the most valuable targets on any network. They have expensive hardware, they process sensitive data (the whole reason you're self-hosting instead of using an API), and they're often set up by ML engineers who optimize for CUDA drivers, not security baselines. I've audited GPU servers at three organizations and found the same pattern every time: root SSH with password auth, Docker running as root with the socket exposed to all users, no firewall rules beyond the cloud provider's security group, and NVIDIA drivers installed from a random .run file downloaded over HTTP.

Linux server hardening for AI workloads follows the same principles as any server hardening, least privilege, defense in depth, audit everything, but the GPU stack introduces specific attack surface that generic hardening guides miss.

SSH Hardening: Beyond Key-Only Auth

Key-only SSH authentication is the floor, not the ceiling.

Edit /etc/ssh/sshd_config:

# Disable password and root login
PermitRootLogin no
PasswordAuthentication no
KbdInteractiveAuthentication no
UsePAM yes

# Restrict to specific users/groups
AllowGroups ssh-users

# Use only strong key exchange and ciphers
KexAlgorithms sstk-nistp521-sha512@openssh.com,curve25519-sha256,curve25519-sha256@libssh.org
Ciphers chacha20-poly1305@openssh.com,aes256-gcm@openssh.com,aes128-gcm@openssh.com
MACs hmac-sha2-512-etm@openssh.com,hmac-sha2-256-etm@openssh.com

# Limit authentication attempts
MaxAuthTries 3
LoginGraceTime 30

# Disable X11, TCP, and agent forwarding by default
X11Forwarding no
AllowTcpForwarding no
AllowAgentForwarding no

# Log verbosely
LogLevel VERBOSE

# Close idle connections
ClientAliveInterval 300
ClientAliveCountMax 2

After editing:

# Validate config before restarting
sudo sshd -t

# Restart
sudo systemctl restart sshd

Certificate-based SSH is the next level. Instead of distributing public keys to every server, you sign short-lived certificates with a CA. The server trusts the CA; clients present certificates that expire in 8-24 hours. When someone leaves the team, you don't have to scrub their key from 50 servers, their certificates just stop being issued.

# On your CA machine:
# Sign a user's key for 8 hours
ssh-keygen -s /path/to/ca_key -I "tyler@company" -n tyler -V +8h ~/.ssh/id_ed25519.pub

# On the server, trust the CA
echo "TrustedUserCAKeys /etc/ssh/ca.pub" >> /etc/ssh/sshd_config

Smallstep and Teleport automate this. For teams of 3+, certificate-based SSH is worth the setup cost.

Firewall Rules for Inference Servers

GPU servers typically expose an inference API (port 8000, 8080, 11434 for Ollama) plus SSH. Everything else should be blocked.

Using ufw (Uncomplicated Firewall):

# Reset to clean state
sudo ufw reset

# Default deny incoming, allow outgoing
sudo ufw default deny incoming
sudo ufw default allow outgoing

# Allow SSH from specific IP range only (your office/VPN)
sudo ufw allow from 10.0.0.0/24 to any port 22 proto tcp

# Allow inference API from internal network only
sudo ufw allow from 10.0.0.0/24 to any port 8000 proto tcp
sudo ufw allow from 10.0.0.0/24 to any port 8080 proto tcp

# If using Prometheus node exporter for monitoring
sudo ufw allow from 10.0.1.0/24 to any port 9100 proto tcp

# Enable
sudo ufw enable
sudo ufw status verbose

For production, use iptables or nftables directly for more control:

# /etc/iptables/rules.v4
*filter
:INPUT DROP [0:0]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [0:0]

# Allow established connections
-A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# Allow loopback
-A INPUT -i lo -j ACCEPT

# Allow SSH from VPN only
-A INPUT -p tcp --dport 22 -s 10.0.0.0/24 -j ACCEPT

# Allow inference API from internal network
-A INPUT -p tcp --dport 8000 -s 10.0.0.0/24 -j ACCEPT
-A INPUT -p tcp --dport 8080 -s 10.0.0.0/24 -j ACCEPT

# Allow ICMP (ping) from internal
-A INPUT -p icmp -s 10.0.0.0/24 -j ACCEPT

# Log and drop everything else
-A INPUT -j LOG --log-prefix "IPT_DROP: " --log-level 4
-A INPUT -j DROP

COMMIT

Never expose inference ports to the public internet. Even with API key authentication, an exposed LLM endpoint is an invitation for prompt injection attacks, resource exhaustion, and data exfiltration. Put it behind a reverse proxy (nginx, Caddy) with rate limiting, or behind a VPN.

Docker Security for GPU Workloads

Docker is how most ML engineers deploy inference. The NVIDIA Container Toolkit (nvidia-ctk) exposes GPUs to containers. The default configuration has significant security gaps.

Rootless Docker

Docker traditionally requires root. A container escape with root Docker gives the attacker root on the host. Rootless Docker runs the daemon as a regular user:

# Install rootless Docker
dockerd-rootless-setuptool.sh install

# Configure NVIDIA runtime for rootless mode
nvidia-ctk runtime configure --runtime=docker --config=$HOME/.config/docker/daemon.json
systemctl --user restart docker

Rootless Docker with GPU access landed in NVIDIA Container Toolkit 1.14+. It works. Use it.

Docker Socket Protection

The Docker socket (/var/run/docker.sock) is equivalent to root access. Anyone who can write to it can start a privileged container that mounts the host filesystem. Don't add users to the docker group casually.

# Check who has Docker access
getent group docker

# If someone shouldn't be there
sudo gpasswd -d username docker

For CI/CD pipelines that need to build containers, use a Docker-in-Docker setup or a remote Docker host rather than mounting the socket.

Container Hardening

# docker-compose.yml hardening
services:
  inference:
    image: your-inference-image:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    read_only: true                     # Read-only root filesystem
    tmpfs:                              # Writeable temp directories as tmpfs
      - /tmp:size=1G
      - /var/tmp:size=512M
    security_opt:
      - no-new-privileges:true          # Prevent privilege escalation
    cap_drop:
      - ALL                             # Drop all Linux capabilities
    cap_add:
      - SYS_NICE                        # Needed for GPU priority scheduling
    user: "1000:1000"                   # Run as non-root user
    networks:
      - internal                        # Isolated network
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

networks:
  internal:
    internal: true                      # No external access by default

Key settings:

read_only: true prevents the container from writing to its filesystem. If an attacker gets code execution, they can't persist malware.
no-new-privileges blocks setuid binaries and privilege escalation.
cap_drop: ALL removes all Linux capabilities. GPU access doesn't need NET_RAW, SYS_ADMIN, or any of the other capabilities that enable container escapes.
SYS_NICE is needed for CUDA's thread priority management. Without it, GPU scheduling can degrade.

NVIDIA Driver and CUDA Attack Surface

The NVIDIA kernel module (nvidia.ko) runs with kernel privileges. A vulnerability in the driver is a direct path to root. This is not theoretical, NVIDIA has issued security bulletins for driver vulnerabilities that allow local privilege escalation. Keep drivers updated. Check for security updates monthly:

# Check current driver version
nvidia-smi --query-gpu=driver_version --format=csv,noheader

# On Ubuntu, update via package manager (preferred over .run files)
sudo apt update
sudo apt install --only-upgrade nvidia-driver-550

Never install drivers from .run files in production. The runfile installer bypasses the package manager, doesn't receive security updates automatically, and can break on kernel updates. Use your distro's package manager or the CUDA repository. Restrict /dev/nvidia* access. By default, NVIDIA device files are world-readable. Limit access to users who need GPU:

# /etc/udev/rules.d/70-nvidia.rules
KERNEL=="nvidia*", GROUP="gpu-users", MODE="0660"
KERNEL=="nvidiactl", GROUP="gpu-users", MODE="0660"
KERNEL=="nvidia-uvm", GROUP="gpu-users", MODE="0660"

sudo groupadd gpu-users
sudo usermod -aG gpu-users inference-user
sudo udevadm control --reload-rules
sudo udevadm trigger

Now only members of gpu-users can access the GPU. Your inference container runs as inference-user; other system users can't enumerate or use the GPU.

Systemd Service Hardening

If you run inference directly (without Docker), systemd unit hardening is critical:

# /etc/systemd/system/inference.service
[Unit]
Description=LLM Inference API
After=network-online.target nvidia-persistenced.service
Wants=network-online.target

[Service]
Type=exec
User=inference
Group=inference
ExecStart=/opt/inference/venv/bin/uvicorn main:app --host 127.0.0.1 --port 8000

# Filesystem restrictions
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/opt/inference/cache
PrivateTmp=true
PrivateDevices=false          # false, need GPU device access
DeviceAllow=/dev/nvidia0 rw
DeviceAllow=/dev/nvidiactl rw
DeviceAllow=/dev/nvidia-uvm rw

# Network restrictions
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
IPAddressDeny=any
IPAddressAllow=10.0.0.0/24
IPAddressAllow=127.0.0.0/8

# Privilege restrictions
NoNewPrivileges=true
CapabilityBoundingSet=
AmbientCapabilities=
ProtectKernelModules=true
ProtectKernelTunables=true
ProtectControlGroups=true

# Process restrictions
MemoryMax=32G
CPUQuota=400%                 # Limit to 4 CPU cores
TasksMax=128

# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=inference

[Install]
WantedBy=multi-user.target

The key directives:

ProtectSystem=strict makes the entire filesystem read-only except explicitly allowed paths
DeviceAllow whitelists only the specific NVIDIA devices needed
IPAddressDeny=any + IPAddressAllow restricts network access at the systemd level (defense in depth with the firewall)
MemoryMax=32G prevents the inference process from consuming all system RAM (important when GPU memory spills to system RAM)
NoNewPrivileges=true prevents privilege escalation

Audit Logging for Model Access

Every inference request should be logged with enough context to answer: who asked what, when, and what was the response?

On the application side (your FastAPI inference proxy):

import logging
import json
import time
from hashlib import sha256

audit_logger = logging.getLogger("audit")
audit_handler = logging.FileHandler("/var/log/inference/audit.jsonl")
audit_handler.setFormatter(logging.Formatter("%(message)s"))
audit_logger.addHandler(audit_handler)
audit_logger.setLevel(logging.INFO)

def log_inference_request(
    api_key_hash: str,
    model: str,
    message_count: int,
    input_tokens: int,
    output_tokens: int,
    duration_ms: float,
    source_ip: str,
):
    audit_logger.info(json.dumps({
        "timestamp": time.time(),
        "event": "inference_request",
        "api_key_hash": api_key_hash[:16],  # Truncated hash for identity
        "model": model,
        "message_count": message_count,
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "duration_ms": round(duration_ms, 2),
        "source_ip": source_ip,
    }))

Never log the actual prompt or response in the audit log by default. If you're self-hosting for data sovereignty, logging the content defeats the purpose. Log metadata only. Enable content logging only in debug mode with explicit opt-in.

On the system side, use auditd to track file access to model weights:

# /etc/audit/rules.d/inference.rules
# Log all access to model files
-w /opt/models/ -p rwa -k model_access

# Log all Docker/container commands
-w /usr/bin/docker -p x -k docker_commands
-w /usr/bin/containerd -p x -k container_runtime

# Log GPU device access
-w /dev/nvidia0 -p rw -k gpu_access

sudo augenrules --load
sudo systemctl restart auditd

Query audit logs:

# Who accessed model files in the last hour?
ausearch -k model_access -ts recent

# What Docker commands were run?
ausearch -k docker_commands -ts today

Network Segmentation for GPU Clusters

If you're running multiple GPU servers (a cluster for distributed inference or training), segment the network:

┌─────────────────────────────────────────────────┐
│                  Public Internet                │
└──────────────────────┬──────────────────────────┘
                       │
              ┌────────▼────────┐
              │  Reverse Proxy  │  (nginx/Caddy)
              │   10.0.0.2      │  Port 443 only
              └────────┬────────┘
                       │
         ┌─────────────▼─────────────┐
         │    Application Network    │
         │      10.0.1.0/24          │
         │  (API servers, queues)    │
         └─────────────┬─────────────┘
                       │
         ┌─────────────▼─────────────┐
         │      GPU Network          │
         │      10.0.2.0/24          │
         │  (inference servers)      │
         │  NO internet access       │
         └───────────────────────────┘

The GPU network has no default route to the internet. It can only communicate with the application network for receiving inference requests and sending responses. Model weights are loaded during deployment (pulled from internal storage), not downloaded at runtime.

For multi-node inference (tensor parallelism across GPUs on different machines), the GPU-to-GPU communication uses NCCL over a dedicated high-speed network (InfiniBand or RoCE). This fabric should be on a completely separate VLAN with no routing to any other network.

Security Framework Comparison

Framework	Scope	GPU-Specific Guidance	Automation	Best For
CIS Benchmarks	OS-level hardening	No	CIS-CAT scanner	General Linux hardening baseline
NIST SP 800-53	Federal/compliance	No	OSCAP profiles	Regulated industries, compliance
DISA STIG	DoD requirements	No	STIG Viewer + scripts	Government/defense contractors
Docker CIS Benchmark	Container security	Partial	docker-bench-security	Docker deployments
NVIDIA Security Guide	GPU container security	Yes	Manual	GPU workloads specifically

My recommendation: start with the CIS Benchmark for your distro (Ubuntu, RHEL, etc.), layer on the Docker CIS Benchmark if using containers, and add the NVIDIA security recommendations for GPU-specific hardening. NIST and STIG are necessary for compliance contexts but overkill for most startups.

Run docker-bench-security to check your Docker configuration:

git clone https://github.com/docker/docker-bench-security.git
cd docker-bench-security
sudo sh docker-bench-security.sh

It will output a list of PASS/WARN/FAIL checks. Fix everything marked WARN or FAIL before exposing the server to production traffic.

If you're running self-hosted LLMs with FastAPI, everything in this post applies to that infrastructure. And for the agentic systems that build on top of local inference, Agentic AI: Multi-Agent Systems covers the orchestration layer, but none of it matters if the servers running the agents aren't locked down.

Continue Reading

Self-Hosting LLMs with FastAPI: The Complete Production Guide

Running quantized LLMs behind a FastAPI proxy with Ollama and vLLM backends. Covers model quantization tradeoffs, GGUF vs GPTQ vs AWQ, streaming responses, request queuing, Docker Compose deployment, and production monitoring.

Mar 31, 2026•24 min read

systems

Rust and WebAssembly for Browser Performance That JavaScript Can't Touch

Compiling Rust to WebAssembly for real browser performance wins. Includes image filter benchmarks (grayscale, box blur), SIMD optimization, JS-Wasm boundary analysis, bundle size strategies, and Next.js integration.

Mar 24, 2026•18 min read

systems

Understanding LTI 1.3 Integration: The Protocol Behind Every LMS Tool Launch

A deep dive into LTI 1.3, the OIDC-based protocol that connects learning tools to Canvas, Moodle, and Blackboard. Covers the three-step launch flow, JWT anatomy, ltijs implementation, NRPS roster access, and AGS grade passback.

Mar 10, 2026•25 min read

back to blog

systems

Linux Server Hardening for AI Workloads: The Security Guide Nobody Wrote

Tyler McDaniel

AI Engineer & IBM Business Partner

Feb 24, 202622 min read

#linux#security#docker#nvidia#hardening

SSH Hardening: Beyond Key-Only Auth

Key-only SSH authentication is the floor, not the ceiling.

Edit /etc/ssh/sshd_config:

# Disable password and root login
PermitRootLogin no
PasswordAuthentication no
KbdInteractiveAuthentication no
UsePAM yes

# Restrict to specific users/groups
AllowGroups ssh-users

# Use only strong key exchange and ciphers
KexAlgorithms sstk-nistp521-sha512@openssh.com,curve25519-sha256,curve25519-sha256@libssh.org
Ciphers chacha20-poly1305@openssh.com,aes256-gcm@openssh.com,aes128-gcm@openssh.com
MACs hmac-sha2-512-etm@openssh.com,hmac-sha2-256-etm@openssh.com

# Limit authentication attempts
MaxAuthTries 3
LoginGraceTime 30

# Disable X11, TCP, and agent forwarding by default
X11Forwarding no
AllowTcpForwarding no
AllowAgentForwarding no

# Log verbosely
LogLevel VERBOSE

# Close idle connections
ClientAliveInterval 300
ClientAliveCountMax 2

After editing:

# Validate config before restarting
sudo sshd -t

# Restart
sudo systemctl restart sshd

# On your CA machine:
# Sign a user's key for 8 hours
ssh-keygen -s /path/to/ca_key -I "tyler@company" -n tyler -V +8h ~/.ssh/id_ed25519.pub

# On the server, trust the CA
echo "TrustedUserCAKeys /etc/ssh/ca.pub" >> /etc/ssh/sshd_config

Smallstep and Teleport automate this. For teams of 3+, certificate-based SSH is worth the setup cost.

Firewall Rules for Inference Servers

GPU servers typically expose an inference API (port 8000, 8080, 11434 for Ollama) plus SSH. Everything else should be blocked.

Using ufw (Uncomplicated Firewall):

# Reset to clean state
sudo ufw reset

# Default deny incoming, allow outgoing
sudo ufw default deny incoming
sudo ufw default allow outgoing

# Allow SSH from specific IP range only (your office/VPN)
sudo ufw allow from 10.0.0.0/24 to any port 22 proto tcp

# Allow inference API from internal network only
sudo ufw allow from 10.0.0.0/24 to any port 8000 proto tcp
sudo ufw allow from 10.0.0.0/24 to any port 8080 proto tcp

# If using Prometheus node exporter for monitoring
sudo ufw allow from 10.0.1.0/24 to any port 9100 proto tcp

# Enable
sudo ufw enable
sudo ufw status verbose

For production, use iptables or nftables directly for more control:

# /etc/iptables/rules.v4
*filter
:INPUT DROP [0:0]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [0:0]

# Allow established connections
-A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# Allow loopback
-A INPUT -i lo -j ACCEPT

# Allow SSH from VPN only
-A INPUT -p tcp --dport 22 -s 10.0.0.0/24 -j ACCEPT

# Allow inference API from internal network
-A INPUT -p tcp --dport 8000 -s 10.0.0.0/24 -j ACCEPT
-A INPUT -p tcp --dport 8080 -s 10.0.0.0/24 -j ACCEPT

# Allow ICMP (ping) from internal
-A INPUT -p icmp -s 10.0.0.0/24 -j ACCEPT

# Log and drop everything else
-A INPUT -j LOG --log-prefix "IPT_DROP: " --log-level 4
-A INPUT -j DROP

COMMIT

Docker Security for GPU Workloads

Docker is how most ML engineers deploy inference. The NVIDIA Container Toolkit (nvidia-ctk) exposes GPUs to containers. The default configuration has significant security gaps.

Rootless Docker

Docker traditionally requires root. A container escape with root Docker gives the attacker root on the host. Rootless Docker runs the daemon as a regular user:

# Install rootless Docker
dockerd-rootless-setuptool.sh install

# Configure NVIDIA runtime for rootless mode
nvidia-ctk runtime configure --runtime=docker --config=$HOME/.config/docker/daemon.json
systemctl --user restart docker

Rootless Docker with GPU access landed in NVIDIA Container Toolkit 1.14+. It works. Use it.

Docker Socket Protection

# Check who has Docker access
getent group docker

# If someone shouldn't be there
sudo gpasswd -d username docker

For CI/CD pipelines that need to build containers, use a Docker-in-Docker setup or a remote Docker host rather than mounting the socket.

Container Hardening

# docker-compose.yml hardening
services:
  inference:
    image: your-inference-image:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    read_only: true                     # Read-only root filesystem
    tmpfs:                              # Writeable temp directories as tmpfs
      - /tmp:size=1G
      - /var/tmp:size=512M
    security_opt:
      - no-new-privileges:true          # Prevent privilege escalation
    cap_drop:
      - ALL                             # Drop all Linux capabilities
    cap_add:
      - SYS_NICE                        # Needed for GPU priority scheduling
    user: "1000:1000"                   # Run as non-root user
    networks:
      - internal                        # Isolated network
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

networks:
  internal:
    internal: true                      # No external access by default

Key settings:

read_only: true prevents the container from writing to its filesystem. If an attacker gets code execution, they can't persist malware.
no-new-privileges blocks setuid binaries and privilege escalation.
cap_drop: ALL removes all Linux capabilities. GPU access doesn't need NET_RAW, SYS_ADMIN, or any of the other capabilities that enable container escapes.
SYS_NICE is needed for CUDA's thread priority management. Without it, GPU scheduling can degrade.

NVIDIA Driver and CUDA Attack Surface

# Check current driver version
nvidia-smi --query-gpu=driver_version --format=csv,noheader

# On Ubuntu, update via package manager (preferred over .run files)
sudo apt update
sudo apt install --only-upgrade nvidia-driver-550

# /etc/udev/rules.d/70-nvidia.rules
KERNEL=="nvidia*", GROUP="gpu-users", MODE="0660"
KERNEL=="nvidiactl", GROUP="gpu-users", MODE="0660"
KERNEL=="nvidia-uvm", GROUP="gpu-users", MODE="0660"

sudo groupadd gpu-users
sudo usermod -aG gpu-users inference-user
sudo udevadm control --reload-rules
sudo udevadm trigger

Now only members of gpu-users can access the GPU. Your inference container runs as inference-user; other system users can't enumerate or use the GPU.

Systemd Service Hardening

If you run inference directly (without Docker), systemd unit hardening is critical:

# /etc/systemd/system/inference.service
[Unit]
Description=LLM Inference API
After=network-online.target nvidia-persistenced.service
Wants=network-online.target

[Service]
Type=exec
User=inference
Group=inference
ExecStart=/opt/inference/venv/bin/uvicorn main:app --host 127.0.0.1 --port 8000

# Filesystem restrictions
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/opt/inference/cache
PrivateTmp=true
PrivateDevices=false          # false, need GPU device access
DeviceAllow=/dev/nvidia0 rw
DeviceAllow=/dev/nvidiactl rw
DeviceAllow=/dev/nvidia-uvm rw

# Network restrictions
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
IPAddressDeny=any
IPAddressAllow=10.0.0.0/24
IPAddressAllow=127.0.0.0/8

# Privilege restrictions
NoNewPrivileges=true
CapabilityBoundingSet=
AmbientCapabilities=
ProtectKernelModules=true
ProtectKernelTunables=true
ProtectControlGroups=true

# Process restrictions
MemoryMax=32G
CPUQuota=400%                 # Limit to 4 CPU cores
TasksMax=128

# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=inference

[Install]
WantedBy=multi-user.target

The key directives:

ProtectSystem=strict makes the entire filesystem read-only except explicitly allowed paths
DeviceAllow whitelists only the specific NVIDIA devices needed
IPAddressDeny=any + IPAddressAllow restricts network access at the systemd level (defense in depth with the firewall)
MemoryMax=32G prevents the inference process from consuming all system RAM (important when GPU memory spills to system RAM)
NoNewPrivileges=true prevents privilege escalation

Audit Logging for Model Access

Every inference request should be logged with enough context to answer: who asked what, when, and what was the response?

On the application side (your FastAPI inference proxy):

import logging
import json
import time
from hashlib import sha256

audit_logger = logging.getLogger("audit")
audit_handler = logging.FileHandler("/var/log/inference/audit.jsonl")
audit_handler.setFormatter(logging.Formatter("%(message)s"))
audit_logger.addHandler(audit_handler)
audit_logger.setLevel(logging.INFO)

def log_inference_request(
    api_key_hash: str,
    model: str,
    message_count: int,
    input_tokens: int,
    output_tokens: int,
    duration_ms: float,
    source_ip: str,
):
    audit_logger.info(json.dumps({
        "timestamp": time.time(),
        "event": "inference_request",
        "api_key_hash": api_key_hash[:16],  # Truncated hash for identity
        "model": model,
        "message_count": message_count,
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "duration_ms": round(duration_ms, 2),
        "source_ip": source_ip,
    }))

On the system side, use auditd to track file access to model weights:

# /etc/audit/rules.d/inference.rules
# Log all access to model files
-w /opt/models/ -p rwa -k model_access

# Log all Docker/container commands
-w /usr/bin/docker -p x -k docker_commands
-w /usr/bin/containerd -p x -k container_runtime

# Log GPU device access
-w /dev/nvidia0 -p rw -k gpu_access

sudo augenrules --load
sudo systemctl restart auditd

Query audit logs:

# Who accessed model files in the last hour?
ausearch -k model_access -ts recent

# What Docker commands were run?
ausearch -k docker_commands -ts today

Network Segmentation for GPU Clusters

If you're running multiple GPU servers (a cluster for distributed inference or training), segment the network:

┌─────────────────────────────────────────────────┐
│                  Public Internet                │
└──────────────────────┬──────────────────────────┘
                       │
              ┌────────▼────────┐
              │  Reverse Proxy  │  (nginx/Caddy)
              │   10.0.0.2      │  Port 443 only
              └────────┬────────┘
                       │
         ┌─────────────▼─────────────┐
         │    Application Network    │
         │      10.0.1.0/24          │
         │  (API servers, queues)    │
         └─────────────┬─────────────┘
                       │
         ┌─────────────▼─────────────┐
         │      GPU Network          │
         │      10.0.2.0/24          │
         │  (inference servers)      │
         │  NO internet access       │
         └───────────────────────────┘

Security Framework Comparison

Framework	Scope	GPU-Specific Guidance	Automation	Best For
CIS Benchmarks	OS-level hardening	No	CIS-CAT scanner	General Linux hardening baseline
NIST SP 800-53	Federal/compliance	No	OSCAP profiles	Regulated industries, compliance
DISA STIG	DoD requirements	No	STIG Viewer + scripts	Government/defense contractors
Docker CIS Benchmark	Container security	Partial	docker-bench-security	Docker deployments
NVIDIA Security Guide	GPU container security	Yes	Manual	GPU workloads specifically

Run docker-bench-security to check your Docker configuration:

git clone https://github.com/docker/docker-bench-security.git
cd docker-bench-security
sudo sh docker-bench-security.sh

It will output a list of PASS/WARN/FAIL checks. Fix everything marked WARN or FAIL before exposing the server to production traffic.

Continue Reading

Linux Server Hardening for AI Workloads: The Security Guide Nobody Wrote

SSH Hardening: Beyond Key-Only Auth

Firewall Rules for Inference Servers

Docker Security for GPU Workloads

Rootless Docker

Docker Socket Protection

Container Hardening

NVIDIA Driver and CUDA Attack Surface

Systemd Service Hardening

Audit Logging for Model Access

Network Segmentation for GPU Clusters

Security Framework Comparison

Related Posts on To Stupid Too Quit

Continue Reading

Self-Hosting LLMs with FastAPI: The Complete Production Guide

Rust and WebAssembly for Browser Performance That JavaScript Can't Touch

Understanding LTI 1.3 Integration: The Protocol Behind Every LMS Tool Launch

Linux Server Hardening for AI Workloads: The Security Guide Nobody Wrote

SSH Hardening: Beyond Key-Only Auth

Firewall Rules for Inference Servers

Docker Security for GPU Workloads

Rootless Docker

Docker Socket Protection

Container Hardening

NVIDIA Driver and CUDA Attack Surface

Systemd Service Hardening

Audit Logging for Model Access

Network Segmentation for GPU Clusters

Security Framework Comparison

Related Posts on To Stupid Too Quit

Continue Reading

Self-Hosting LLMs with FastAPI: The Complete Production Guide

Rust and WebAssembly for Browser Performance That JavaScript Can't Touch

Understanding LTI 1.3 Integration: The Protocol Behind Every LMS Tool Launch