docs: update architecture to reflect Proxmox migration and correct network configurations

Revised multiple documents to align with the migration from Incus to Proxmox VE 8.4.10. Updated hypervisor, IP ranges, subnet details, and NAT configurations across all relevant files. Marked Incus sections as historical for clarity. Added AI-Stack setup guide for Proxmox LXC.
2026-03-06 13:50:56 +01:00
parent 09b0b1a462
commit 5ab0c9524e
26 changed files with 749 additions and 40 deletions
@@ -0,0 +1,431 @@
+---
+type: Guide
+status: ACTIVE
+owner: DevOps Engineer
+---
+# Installations-Anleitung: Lokaler AI-Stack auf Zora (MS-R1)
+
+> **Ziel:** Ollama + Open WebUI als isolierter Proxmox LXC-Container auf Zora.
+> Vollständig lokal, datenschutzkonform, erreichbar via Pangolin-Tunnel.
+> **Datum:** 2026-03-06 | **Aktualisiert:** 2026-03-06 (Incus → Proxmox)
+
+---
+
+## Hardware-Profil: CIX P1 (CP8180) — Was steckt drin?
+
+| Komponente | Detail | Bedeutung für KI |
+|---|---|---|
+| **CPU** | 12 Cores: 4x Cortex-X4 (fast, ~2.6GHz) + 4x A720 (medium) + 4x A520 (slow) | Tri-Cluster → großzügige Kern-Zuweisung sinnvoll |
+| **GPU** | Arm Immortalis-G720 MC10 | Vulkan 1.3 — experimentelle GPU-Beschleunigung möglich |
+| **RAM** | 64 GB LPDDR5 5500MHz | Auch 70B-Modelle laufen vollständig im RAM! |
+| **NPU** | CIX P1 integrierte NPU | ⚠️ Aktuell kein Ollama/llama.cpp-Support — Zukunft |
+| **OS** | Proxmox VE 8.4.10 | Hypervisor auf Debian-Basis, ARM64-native |
+
+### CPU-Kerne im Detail
+
+```
+Cores  0– 3  →  Cortex-A520  (Efficiency / langsam)
+Cores  4– 7  →  Cortex-A720  (Balanced / mittel)
+Cores  8–11  →  Cortex-X4    (Performance / schnell)
+```
+
+Der AI-Container bekommt 10 Kerne (2–11), Proxmox behält Kern 0–1 für Host-Betrieb.
+CPU-Governor auf `performance` für die X4-Kerne maximiert den Inferenz-Durchsatz.
+
+---
+
+## Architektur-Entscheidung: Warum separater Proxmox LXC-Container?
+
+```
+Zora — Proxmox 8.4.10 (10.0.0.20)
+├── VM  102  gitea-runner    (10.0.0.23)  ← Gitea CI/CD Runner
+├── VM  110  meldestelle-host (10.0.0.50) ← Docker App-Stack
+├── LXC 100  pangolin-client              ← Pangolin Tunnel
+├── LXC 101  gitea           (10.0.0.22)  ← Gitea Server
+├── LXC 103  immich                       ← Immich
+└── LXC 111  ai-stack        (10.0.0.60)  ← Ollama + Open WebUI  ← NEU
+```
+
+**Begründung:** Modelle (5–40 GB pro Modell) wachsen unkontrolliert.
+Isolierter LXC-Container schützt den App-Stack vor RAM/CPU-Hunger der KI.
+Unabhängige Updates — Ollama-Modelle liegen im Container-Volume, nicht im Git-Repo.
+
+---
+
+## Phase 1 — Proxmox Host vorbereiten
+
+### 1.1 — CPU Governor auf Performance setzen (auf Proxmox-Node)
+
+```bash
+# SSH auf den Proxmox-Node
+ssh root@10.0.0.20
+# oder: ssh root@pve.mo-code.at
+
+# cpufrequtils installieren
+apt-get install -y cpufrequtils
+
+# Alle 12 Kerne auf Performance
+for i in $(seq 0 11); do
+    cpufreq-set -c $i -g performance
+done
+
+# Prüfen
+cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
+# Erwartete Ausgabe: 12x "performance"
+
+# Persistent machen
+tee /etc/systemd/system/cpu-performance.service > /dev/null <<'EOF'
+[Unit]
+Description=Set CPU Governor to Performance
+After=multi-user.target
+
+[Service]
+Type=oneshot
+ExecStart=/bin/bash -c 'for i in $(seq 0 11); do cpufreq-set -c $i -g performance; done'
+RemainAfterExit=yes
+
+[Install]
+WantedBy=multi-user.target
+EOF
+
+systemctl daemon-reload
+systemctl enable --now cpu-performance.service
+```
+
+### 1.2 — Huge Pages aktivieren (bessere RAM-Nutzung für große Modelle)
+
+```bash
+# Auf dem Proxmox-Node:
+echo "vm.nr_hugepages = 512" >> /etc/sysctl.conf
+sysctl -p
+```
+
+---
+
+## Phase 2 — Proxmox LXC-Container erstellen
+
+> ℹ️ **Alle Befehle laufen auf dem Proxmox-Node** (`ssh root@10.0.0.20`)
+> Alternativ: Proxmox Web-UI unter `https://pve.mo-code.at:8006`
+
+### 2.1 — Debian 12 Template herunterladen
+
+```bash
+# Template-Liste aktualisieren
+pveam update
+
+# Debian 12 ARM64 Template suchen und herunterladen
+pveam available --section system | grep debian-12.*arm64
+pveam download local debian-12-standard_12.7-1_arm64.tar.zst
+```
+
+### 2.2 — Container erstellen (CT 111)
+
+```bash
+pct create 111 local:vztmpl/debian-12-standard_12.7-1_arm64.tar.zst \
+  --hostname ai-stack \
+  --arch aarch64 \
+  --cores 10 \
+  --memory 49152 \
+  --swap 4096 \
+  --rootfs local-lvm:200 \
+  --net0 name=eth0,bridge=vmbr0,ip=10.0.0.60/24,gw=10.0.0.138,firewall=1 \
+  --nameserver 10.0.0.138 \
+  --searchdomain mo-code.at \
+  --unprivileged 1 \
+  --features nesting=1 \
+  --password
+
+# Container starten
+pct start 111
+
+# Status prüfen
+pct status 111
+pct list
+```
+
+### 2.3 — CPU-Pinning konfigurieren (Performance-Kerne für KI)
+
+```bash
+# Kerne 2–11 dem Container zuweisen (Kerne 8–11 = X4 Performance-Kerne!)
+# In der Container-Config:
+echo "cpulimit: 10" >> /etc/pve/lxc/111.conf
+echo "cpuunits: 1024" >> /etc/pve/lxc/111.conf
+
+# Alternativ via Web-UI:
+# CT 111 → Options → CPU Limit: 10 Cores
+```
+
+### 2.4 — In Container einloggen und Basis-Setup
+
+```bash
+# Direkt per pct:
+pct enter 111
+
+# Im Container:
+apt-get update && apt-get upgrade -y
+apt-get install -y curl wget git htop nano ca-certificates gnupg lsb-release
+
+# Docker installieren (für Open WebUI)
+curl -fsSL https://get.docker.com | sh
+systemctl enable --now docker
+
+# Benutzer anlegen
+useradd -m -s /bin/bash aiuser
+usermod -aG docker aiuser
+```
+
+---
+
+## Phase 3 — Ollama installieren & optimieren
+
+### 3.1 — Ollama installieren
+
+```bash
+# Im Container (pct enter 111):
+curl -fsSL https://ollama.com/install.sh | sh
+
+# ARM64 wird automatisch erkannt
+ollama --version
+```
+
+### 3.2 — Ollama-Service konfigurieren (Performance-Tuning)
+
+```bash
+mkdir -p /etc/systemd/system/ollama.service.d/
+cat > /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
+[Service]
+# Auf allen Interfaces lauschen (nicht nur localhost)
+Environment="OLLAMA_HOST=0.0.0.0:11434"
+
+# Alle zugewiesenen CPU-Kerne nutzen
+Environment="OLLAMA_NUM_THREADS=10"
+
+# Modelle 24h im RAM halten (kein ständiges Laden)
+Environment="OLLAMA_KEEP_ALIVE=24h"
+
+# Bis zu 2 Modelle gleichzeitig geladen halten
+Environment="OLLAMA_MAX_LOADED_MODELS=2"
+
+# Flash Attention aktivieren (schnellere Verarbeitung langer Kontexte)
+Environment="OLLAMA_FLASH_ATTENTION=1"
+
+# Kontext-Größe erhöhen für längere Gespräche
+Environment="OLLAMA_MAX_CONTEXT_LENGTH=8192"
+EOF
+
+systemctl daemon-reload
+systemctl restart ollama
+systemctl status ollama
+```
+
+### 3.3 — Vulkan GPU-Beschleunigung (Immortalis-G720)
+
+> ⚠️ **Experimentell** — CIX P1 + Immortalis-G720 Vulkan-Treiber sind noch nicht
+> vollständig im Mainline-Kernel. Zuerst ohne Vulkan starten, später nachrüsten.
+
+```bash
+# Prüfen ob Vulkan-Device sichtbar ist
+apt-get install -y vulkan-tools
+vulkaninfo --summary 2>/dev/null | grep -i "GPU\|device\|driver"
+
+# Falls Vulkan verfügbar — in override.conf ergänzen:
+# Environment="OLLAMA_GPU_LAYERS=999"
+# → Dann ollama neu starten und testen
+```
+
+---
+
+## Phase 4 — Modelle herunterladen
+
+### Empfohlene Modelle für 64 GB ARM64
+
+```bash
+# Im Container:
+ollama pull llama3.1:8b           # 5 GB  — schnell, Allrounder, Deutsch OK
+ollama pull qwen2.5-coder:14b     # 9 GB  — Beste Code-Qualität für IDEA!
+ollama pull nomic-embed-text      # 300MB — Pflicht für RAG/Embeddings
+
+# Optional (wenn du mehr Power willst):
+ollama pull llama3.1:70b          # 40 GB — Maximale Qualität, läuft in 64GB RAM!
+ollama pull qwen2.5:32b           # 20 GB — Gute Balance Qualität/Speed
+
+# Testen:
+ollama run llama3.1:8b "Erkläre mir Spring Boot in einem Satz auf Deutsch"
+```
+
+### Modell-Entscheidungshilfe
+
+| Modell | RAM | Speed | Qualität | Empfehlung |
+|---|---|---|---|---|
+| `llama3.1:8b` | 5 GB | ⚡⚡⚡ | ★★★ | Täglicher Chat, schnelle Antworten |
+| `qwen2.5-coder:14b` | 9 GB | ⚡⚡ | ★★★★ | **IDEA-Integration, Kotlin/Java Code** |
+| `qwen2.5:32b` | 20 GB | ⚡ | ★★★★★ | Tiefe Analysen, Architektur-Fragen |
+| `llama3.1:70b` | 40 GB | 🐢 | ★★★★★ | Maximale Qualität, geduldige Anfragen |
+| `nomic-embed-text` | 300 MB | ⚡⚡⚡ | RAG | **Pflicht für Docs-RAG** |
+
+---
+
+## Phase 5 — Open WebUI installieren
+
+```bash
+# Im Container (pct enter 111):
+docker run -d \
+  --name open-webui \
+  --restart always \
+  -p 3001:8080 \
+  -v open-webui:/app/backend/data \
+  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
+  --add-host=host.docker.internal:10.0.0.60 \
+  ghcr.io/open-webui/open-webui:main
+
+# Prüfen:
+docker ps
+curl http://localhost:3001
+```
+
+### 5.1 — Open WebUI RAG-Konfiguration
+
+```
+Browser: http://10.0.0.60:3001
+
+1. Admin-Account anlegen (erster Login)
+2. Settings → Admin Panel → Connections
+   → Ollama URL: http://10.0.0.60:11434 ✓
+
+3. Settings → Admin Panel → Documents
+   → Embedding Model: nomic-embed-text
+   → Chunk Size: 1500
+   → Chunk Overlap: 150
+
+4. Workspace → Documents → Upload
+   → Deine /docs/**/*.md Dateien hochladen
+   → Besonders: 04_Agents/Playbooks/, 01_Architecture/adr/
+```
+
+---
+
+## Phase 6 — Pangolin-Route konfigurieren
+
+| Route | Ziel | Port | Sichtbarkeit |
+|---|---|---|---|
+| `ai.mo-code.at` | `10.0.0.60` | `3001` | Nur intern / VPN! |
+
+> 🔒 **Sicherheit:** Open WebUI NIEMALS ohne Auth öffentlich exponieren.
+> Pangolin-Zugang nur via VPN oder mit Basic-Auth absichern.
+
+---
+
+## Phase 7 — IntelliJ IDEA Integration
+
+### Option A: Continue.dev Plugin (empfohlen für Code-Completion)
+
+```
+1. IDEA → Settings → Plugins → "Continue" installieren
+2. Continue-Konfiguration öffnen (~/.continue/config.json):
+```
+
+```json
+{
+  "models": [
+    {
+      "title": "Zora-Coder (qwen2.5)",
+      "provider": "ollama",
+      "model": "qwen2.5-coder:14b",
+      "apiBase": "http://10.0.0.60:11434"
+    },
+    {
+      "title": "Zora-Chat (llama3.1)",
+      "provider": "ollama",
+      "model": "llama3.1:8b",
+      "apiBase": "http://10.0.0.60:11434"
+    }
+  ],
+  "tabAutocompleteModel": {
+    "title": "Zora-Autocomplete",
+    "provider": "ollama",
+    "model": "qwen2.5-coder:14b",
+    "apiBase": "http://10.0.0.60:11434"
+  },
+  "embeddingsProvider": {
+    "provider": "ollama",
+    "model": "nomic-embed-text",
+    "apiBase": "http://10.0.0.60:11434"
+  }
+}
+```
+
+```
+3. In IDEA:
+   - Ctrl+I → Chat öffnen (Inline-Fragen im Code)
+   - Ctrl+Shift+I → Tab Autocomplete aktivieren
+   - Alle Daten bleiben auf Zora — kein Cloud-Kontakt!
+```
+
+### Option B: JetBrains AI Assistant mit lokalem Modell
+
+```
+Settings → Tools → AI Assistant
+→ "Use custom AI provider"
+→ Endpoint: http://10.0.0.60:11434/v1
+→ Model: qwen2.5-coder:14b
+→ API Key: ollama (beliebiger String)
+```
+
+---
+
+## Zukunft: NPU-Beschleunigung
+
+Der CIX P1 hat eine integrierte NPU, die aktuell **nicht von Ollama/llama.cpp unterstützt** wird.
+
+**Roadmap:**
+- `llama.cpp` arbeitet an OpenCL/Vulkan-Backend → Immortalis-G720 wird profitieren
+- CIX P1 NPU-Treiber müssen von CIX Technology als Open-Source freigegeben werden
+- **Empfehlung:** System ohne NPU in Betrieb nehmen, NPU-Support nachrüsten sobald verfügbar
+
+**Monitoring:**
+- https://github.com/ollama/ollama/issues (Filter: "vulkan", "arm64")
+- https://github.com/ggml-org/llama.cpp/issues (Filter: "vulkan")
+
+---
+
+## Quick-Reference: Wichtige Befehle
+
+```bash
+# Container verwalten (auf Proxmox-Node: ssh root@10.0.0.20)
+pct start 111
+pct stop 111
+pct enter 111
+pct status 111
+
+# Modelle verwalten (im Container)
+ollama list                        # Installierte Modelle
+ollama pull <modell>               # Neues Modell herunterladen
+ollama rm <modell>                 # Modell löschen
+ollama ps                          # Laufende Modelle
+
+# Logs
+journalctl -u ollama -f            # Ollama-Logs live
+docker logs -f open-webui          # Open WebUI Logs
+
+# Performance prüfen
+htop                               # CPU/RAM-Auslastung
+ollama ps                          # Welches Modell läuft, RAM-Nutzung
+```
+
+---
+
+## Netz-Übersicht nach diesem Setup
+
+```
+Zora — Proxmox 8.4.10 (10.0.0.20)
+├── VM  102  gitea-runner     10.0.0.23   Gitea CI/CD Runner
+├── VM  110  meldestelle-host 10.0.0.50   Docker App-Stack
+├── LXC 101  gitea            10.0.0.22   Gitea Server
+├── LXC 103  immich                       Immich
+└── LXC 111  ai-stack         10.0.0.60   Ollama :11434 | Open WebUI :3001
+
+Pangolin-Tunnel:
+├── ai.mo-code.at    → 10.0.0.60:3001   (Open WebUI — nur intern/VPN)
+├── api.mo-code.at   → 10.0.0.50:8081   (API Gateway)
+└── auth.mo-code.at  → 10.0.0.50:8180   (Keycloak)
+```