Sovereign Local AI Atlas — Runtime → RAG → Agents → Security

1) How to read this atlas

The atlas is intentionally structured as layers. The correct way to use it is to build “downward”: choose your inference engine → expose a local API → attach UIs → attach data stores → add RAG/memory → add agents/tools (MCP) → instrument → attack/test.

Principle: Local ≠ Safe

“Self-hosted” tools can still be exploited via misconfiguration, exposed ports, malicious tool servers, or supply chain. This atlas therefore includes security notes and scanner tooling inline, not as an appendix.

Trust tiers are operational constraints

Sovereign Core: safe defaults possible; strong foundation.
Hybrid: safe only with explicit configuration + network restrictions.
Synthetic-Edge: treated as a boundary — isolate credentials, memory, logs, and network.

2) Upstream “source feeds” (audited lists)

These are discovery surfaces we mined and then filtered. They are useful for breadth, but none are treated as authoritative without verification.

Awesome Local AI (ethicals7s)

Large “local AI” index; high recall; requires re-verification of locality claims.

Source Feed

GitHub

Awesome Local LLM (rafska)

Strong taxonomy across local LLM engines, UIs, RAG, agents, and more.

Source Feed

GitHub

Jan’s awesome-local-ai

Local tooling list with excellent coverage of inference engines and local-first apps.

Source Feed

GitHub Jan repo Jan site

Awesome Production Agentic Systems (EthicalML)

Ops + scaling + monitoring + memory + security tools for production agentic systems.

Source Feed

GitHub

Awesome MCP Servers (punkpeye)

Huge catalog of MCP tool servers; must be filtered hard by locality & blast radius.

Source Feed

GitHub

Awesome Code AI (Sourcegraph) — archived

Historic snapshot of coding tools; archived Feb 23, 2026 (read-only).

Source Feed

GitHub

3) Layer: Inference & Runtime

This layer answers: Where do tokens come from? You need at least one inference engine plus a stable interface that higher layers can consume. In practice that means:

Engine (llama.cpp / vLLM / SGLang / etc.)
API surface (LocalAI / vLLM OpenAI-compatible server / etc.)
Runtime governance (ports, auth, patching cadence, model provenance)

Inference Engines (run the model)

llama.cpp

Minimal C/C++ inference for GGUF models; runs across CPU/GPU; ideal baseline engine.

Sovereign Core Airgap-ready

GitHub Build guide

Why it’s here + how to use it

Role

Baseline local inference engine; also provides an OpenAI-compatible HTTP server in examples.

Best for

Laptops, workstations, minimal deployments, GGUF workflows, tight control.

Notes

Pin a known-good version and treat model files as artifacts with hashes.

vLLM

High-throughput, memory-efficient inference & serving engine; production GPU workhorse.

Sovereign Core OpenAI-compatible server

Site Docs GitHub

Why it’s here + how to use it

Use as your internal “model server” behind a reverse proxy/VPN.
Expose OpenAI-style endpoints where possible to maximize compatibility.
Pair with LocalAI or LiteLLM if you need a unified gateway across multiple backends.

SGLang

High-performance serving framework for LLMs + multimodal; strong for structured/tool outputs.

Sovereign Core Structured outputs

GitHub

Why it’s here + how to use it

Pick SGLang when agent tool-use and JSON contracts matter.
Works from single GPU to distributed clusters; treat it like vLLM (model server layer).

MLC-LLM

Compiler + deployment engine for running models natively across devices (including browser/edge).

Sovereign Core Edge

GitHub Docs/Site

Where it fits

Use when you want sovereignty to extend to phones/devices/browsers without centralized servers.

MLX (Apple Silicon)

Apple’s array framework for ML on Apple Silicon; pairs with MLX-LM for local LLM work.

Sovereign Core Apple

GitHub Docs MLX-LM

OpenAI-Compatible Local APIs (make everything plug in)

LocalAI

Open-source OpenAI-compatible REST API for local inference (text, images, audio).

Sovereign Core Airgap-ready

Site GitHub

Why it matters

Acts as your “drop-in OpenAI replacement” inside the sovereign perimeter.
Lets UIs, IDE plugins, RAG frameworks, and agents reuse the same API shape without cloud keys.

Ollama

Local model runner + API; convenient for desktops and small servers — but must be network-hardened.

Sovereign Core Sandbox-required

Site Docs Exposure report

Hardening checklist (read before deploying)

Do not expose to the public internet (large-scale misconfiguration incidents have been documented).
Bind to localhost or place behind VPN/reverse proxy with auth.
Keep tool-calling / file access out of the model server unless sandboxed.

LiteLLM

Proxy/gateway: routes many model providers through a unified OpenAI-style interface (local + cloud).

Hybrid Cloud gravity

GitHub Docs

Sovereign usage pattern

Use LiteLLM as an internal router where local endpoints are default.
Disable cloud backends unless you explicitly create a “Synthetic-Edge” profile.

Optional performance stacks (use only if you need them)

TensorRT-LLM (NVIDIA)

High-performance inference optimizations for LLMs on NVIDIA GPUs.

Hybrid Vendor gravity

GitHub Quickstart

Triton Inference Server (NVIDIA)

General-purpose inference server for many frameworks (TensorRT, PyTorch, ONNX, etc.).

Hybrid Patch discipline

GitHub Docs

Security note

Treat Triton as high-value infra. Keep it internal-only and patched; enterprise-grade servers attract enterprise-grade attacks.

4) Layer: Human Interfaces (UIs)

UIs are deceptively dangerous: they hold tokens/keys, render untrusted content, and often provide “direct connections” to external model servers. Treat UIs like you would treat your admin panels.

General chat/workbench UIs

Open WebUI

Self-hosted AI platform designed to operate offline; supports Ollama + OpenAI-compatible APIs.

Sovereign Core Patch-required

Site Docs GitHub CVE

Security-critical note (read)

CVE-2025-64496 describes a Direct Connections code-injection vulnerability via SSE events that can lead to token theft, account takeover, and—when chained—backend RCE. Patch + reduce attack surface.

Advisory detail: GitHub Advisory
Deep dive: Cato CTRL analysis
Operational: pin patched versions; gate/disable direct external model connections.

Text Generation WebUI (oobabooga)

Feature-rich local AI web UI; supports GGUF via llama.cpp and many backends.

Sovereign Core Airgap-ready

GitHub

Usage note

Prefer local-only bindings; avoid exposing the UI publicly.
Use as a “lab bench” UI when you need deep model controls (LoRAs, extensions, etc.).

AnythingLLM

Full-stack “private ChatGPT” for chatting with docs; supports many local LLMs + vector DBs.

Sovereign Core RAG suite

Site GitHub Docs repo

Where it fits

Use AnythingLLM when you want an integrated RAG workspace with minimal wiring. Treat it as a UI layer over your own local endpoints.

Jan (desktop)

Open-source ChatGPT alternative aimed at fully local/offline usage with strong privacy posture.

Sovereign Core Airgap-ready

Site GitHub Data folder

LM Studio

Local model UI + local server; can operate entirely offline (closed-source app).

Local-Closed Offline

Offline docs Docs

Sovereign stance

Useful as a convenience interface on individual machines. Do not make it a single point of failure or the only ingress to your models.

Developer/coding UIs

Continue (IDE extension)

Open-source IDE assistant; point it at LocalAI/vLLM/Ollama for local-first coding workflows.

Sovereign Core

GitHub Docs Site

How to keep it local

Configure model provider endpoints to your internal OpenAI-compatible URL (LocalAI/vLLM/etc.).
Disable “cloud models” in profiles unless explicitly using a Synthetic-Edge environment.

Aider (CLI pair programmer)

Terminal-based AI pair programming; edits your repo through git-friendly patches.

Sovereign Core Repo-scoped

GitHub Docs

UI security notes (non-optional reading)

Open WebUI — Direct Connections CVE (code injection → takeover)

Direct Connections lets Open WebUI connect to external model servers. CVE-2025-64496 documents a code injection route via SSE events that can steal auth tokens and lead to takeover and backend RCE when chained.

NVD: CVE-2025-64496
Cato analysis: CVE deep dive

Ollama — massive exposure incidents (misconfiguration)

Security researchers documented ~175,000 publicly exposed Ollama hosts due to unsafe network binding. Treat model servers as internal services; never expose raw endpoints.

The Hacker News: Investigation summary
Ollama docs: Official documentation

5) Layer: Data, Vectors, RAG

This layer answers: Where does context come from? You need (a) storage (structured/unstructured), (b) embeddings/vector search, (c) retrieval pipelines, and (d) document ingestion.

Vector stores (local-first)

Qdrant

Open-source vector DB + similarity search (Rust); self-hostable, scalable.

Sovereign Core

Site Quickstart GitHub

Milvus

Open-source, scalable vector DB; runs from laptop to distributed systems.

Sovereign Core

Docs Local quickstart GitHub

pgvector (PostgreSQL extension)

Vector similarity search inside Postgres; keep vectors with the rest of your data.

Sovereign Core Single DB

GitHub

DuckDB

In-process SQL OLAP DB (MIT); ideal for local analytics and embedded pipelines.

Sovereign Core Embedded

Site GitHub

LanceDB

OSS embedded retrieval + vector search for multimodal AI data; SQL + vectors.

Sovereign Core Embedded

GitHub Site

Pinecone

Fully managed, serverless vector DB (cloud). Included for boundary awareness only.

Synthetic-Edge

Site Docs

Why it’s here

It’s widely used in RAG stacks; we include it only to explicitly mark “cloud vector DB” as a Synthetic boundary option.

RAG frameworks (local-capable, but must be constrained)

LangChain

Popular orchestration + RAG framework; huge integrations ecosystem (local + cloud).

Hybrid Disable hosted services

Docs GitHub

Sovereign usage pattern

Route model calls to LocalAI / vLLM / SGLang / Ollama endpoints only.
Use local vector stores (pgvector/Qdrant/Milvus/LanceDB).
Avoid hosted tracing or cloud default configs unless explicitly segmented.

LangGraph

Stateful agent/workflow graphs (LangChain ecosystem); ideal for long-running, controllable flows.

Hybrid Graph workflows

Docs GitHub

LlamaIndex

RAG & data framework; strong indexing abstractions; supports LocalAI via OpenAI-compatible interface.

Hybrid

LocalAI integration

Haystack

Open-source orchestration framework for RAG and agents; modular and transparent pipelines.

Sovereign Core

Docs GitHub

PrivateGPT (Zylon)

Production-ready private document Q&A that can run without internet; “no data leaves environment”.

Sovereign Core Offline-first

GitHub

6) Layer: Memory (agent persistence)

Memory is not “chat history.” It’s a system: what gets stored, how it’s retrieved, how it’s updated, and how it’s prevented from becoming an injection vector.

Canonical memory index

The most comprehensive resource index we audited: IAAR-Shanghai / Awesome-AI-Memory. Use it as the research+pattern map; keep implementations inside your own perimeter.

Mem0

“Universal memory layer” for agents; stores and recalls personalized context across sessions.

Hybrid Self-host review

GitHub Site

Sovereign stance

Use only when fully self-hosted with local DBs and controlled egress.
Keep memory writes gated; run red-team tests against memory recall paths.

MemOS

Memory Operating System: unified store/retrieve/manage for long-term agent memory.

Sovereign Core System-level memory

GitHub Paper (PDF)

Zep

Context engineering & memory platform; assembles relevant context from histories and data sources.

Hybrid Self-host required

GitHub Site

Graphiti (knowledge-graph memory)

Temporally-aware knowledge graphs for agents in dynamic environments (memory beyond vectors).

Sovereign Core Graph memory

GitHub

Memory threat model (always apply)

Prompt injection via stored memory: malicious content can persist and re-trigger later.
Cross-tenant contamination: if multi-user, enforce hard boundaries (DB row-level policies, separate indices, separate keys).
Unbounded growth: require summarization/compression/expiry policies.

7) Layer: Agents & Orchestration

Agents are where systems become dangerous: they decide, call tools, write files, run code, and mutate state. The job of this layer is to make that power bounded, observable, and attack-tested.

LangGraph

Stateful workflow graphs for long-running agents; good for explicit control and failure boundaries.

Hybrid Local-capable

Docs GitHub

CrewAI

Lean multi-agent framework; role-based orchestration with high-level + low-level control.

Hybrid Cloud examples exist

GitHub Docs

Sovereign use

Point all LLM calls to your local API (LocalAI/vLLM/SGLang).
Tooling via MCP should be sandboxed; do not grant host filesystem by default.

AutoGen (Microsoft)

Multi-agent framework for autonomous or human-in-the-loop workflows.

Hybrid Local-capable

GitHub Docs

Agent rule: no direct host access

Agents should operate in containers/microVMs with scoped filesystem mounts and default-deny egress. If you must allow tool access, do it via MCP servers that enforce least privilege (see next layer).

8) Layer: MCP (Tools) + Sandbox

MCP (Model Context Protocol) is the “tool port” that connects models to resources and actions. This layer exists because the moment an agent can use tools, it can exfiltrate, corrupt, and escalate. Treat MCP servers like plugins with code execution risk.

Official MCP specification

Start here for protocol-level grounding: modelcontextprotocol.io and the specification.

MCP Inspector

Visual testing + proxy tool to run and debug MCP servers safely during development.

Sovereign Core Dev tooling

GitHub

FastMCP (Python)

Pythonic way to build MCP servers; widely used in MCP ecosystems.

Sovereign Core

GitHub

Microsandbox

Self-hosted microVM sandbox for running untrusted workloads fast with strong isolation.

Sovereign Core Sandbox-required

GitHub

Why it matters

Sandboxing is your “blast radius limiter” for agent code execution, browser tools, file conversion, and data analysis.

mcp-scan

Security scanner for MCP servers (prompt injection, tool poisoning, escalation patterns).

Sovereign Core Security

GitHub

MCP policy you can actually enforce

Local-first MCP: filesystem + DB + internal services only.
Synthetic-edge MCP: cloud SaaS servers live in separate profiles/VMs; no shared memory/logging.
Always scan new MCP servers (mcp-scan) before enabling.
Always sandbox servers that execute code, browse web, or parse untrusted documents.

Discovery feed (MCP servers)

Massive catalog: punkpeye/awesome-mcp-servers. Use it only after applying the policy above.

9) Layer: Observability

Observability is not optional: agents are multi-step systems. If you cannot trace calls, tool usage, retrieval, and state transitions, you cannot prove what happened — and you cannot harden.

Langfuse

Open-source LLM observability + prompt management + eval; self-hostable.

Sovereign Core Telemetry default

Self-hosting Telemetry FAQ Docs

Important note: disable telemetry

Langfuse states that by default it reports basic usage statistics of self-hosted instances to PostHog. For sovereign deployments, explicitly disable analytics and enforce with network egress controls.

Discussion example: Disable PostHog errors

Phoenix (Arize)

Open-source AI observability: tracing, evaluation, debugging for LLM apps and agents.

Sovereign Core Tracing

GitHub Docs LLM tracing

PostHog (self-hosted)

Open-source analytics platform; can be self-hosted and used as an internal metrics sink.

Sovereign Core Self-hosted

Self-host docs

Why it’s here

If you use analytics at all, run it inside your own perimeter. Do not leak LLM traces or prompts to third-party SaaS by accident.

Production deployment helpers (optional)

When you need reproducible serving and scale, consider these production frameworks (self-hosted):

BentoML (model serving framework)
Ray Serve (scalable model serving)

10) Layer: Evaluation & Security

This layer exists to answer: Can the system be tricked? You need continuous red-teaming, scanner tooling, and repeatable evaluation harnesses.

promptfoo

Evaluate prompts/agents/RAG; includes red teaming and vulnerability scanning workflows.

Sovereign Core Red team

GitHub Red team docs

How to use it in a sovereign stack

Attack your own prompts, tools, and RAG pipelines before attackers do.
Run in CI (PR checks) and scheduled jobs (regression tests).

Agentic Radar

Security scanner for agent workflows; generates reports on vulnerabilities and operational risks.

Sovereign Core

GitHub

AI-Infra-Guard

Full-stack AI red teaming: infra scan, MCP scan, agent skills scan, jailbreak evaluation.

Sovereign Core

GitHub Docs/Site

MCP-Scan

Scanner designed to audit installed MCP servers for prompt injection, poisoning, escalation patterns.

Sovereign Core

GitHub

Security is a stack (not a checkbox)

UI security: patch CVEs; render untrusted content safely.
Model server security: no public exposure; auth; internal networks.
MCP security: sandbox + scan tool servers; least privilege; isolate cloud MCP.
Memory security: guard writes and retrieval; prevent injection persistence; partition tenants.
Continuous evaluation: promptfoo + workflow scanners; regression tests before releases.

11) Layer: Code AI (local copilot)

Coding AI is a special case: the data (your codebase) is extremely sensitive, and the tool has direct write access to production systems. This atlas therefore treats local, self-hostable code AI as default.

Tabby

Self-hosted AI coding assistant; open-source alternative to GitHub Copilot.

Sovereign Core

GitHub Site

Typical wiring

Tabby runs as your completion backend.
Continue is the IDE front-end.
LocalAI/vLLM can be used for chat/refactor agents when desired.

Continue

IDE front-end; pair it with Tabby + LocalAI for a fully local “copilot stack”.

Sovereign Core

Docs GitHub

Aider

CLI refactor agent; keeps work repo-scoped and git-aware (diffs/commits).

Sovereign Core

Docs GitHub

Rule for code AI

If you use any cloud code assistant at all, do it in a separate, quarantined environment (no secrets, no prod keys, no canonical repos). Keep the sovereign codebase local-first.

Historic catalog (archived): sourcegraph/awesome-code-ai.

12) Layer: Multimodal & Speech

Sovereignty requires voice and vision to remain local as well. These tools provide offline speech-to-text, text-to-speech, and vision-language capability.

whisper.cpp

C/C++ port of Whisper for local speech-to-text; runs offline.

Sovereign Core Offline STT

GitHub

Coqui TTS

Open-source deep learning TTS toolkit; inference + training/fine-tuning.

Sovereign Core TTS

GitHub

LLaVA

Vision-language model project (NeurIPS 2023); local VLM building block.

Sovereign Core VLM

GitHub Project site

Moondream

Small vision-language model designed for edge efficiency and broad device support.

Sovereign Core Edge VLM

GitHub

13) Reference Profiles (Minimal / Dev / Lab)

These profiles help you pick an appropriate complexity level. Each is a “small set of parts” drawn from the layers above.

Profile A — Minimal Airgapped Node

Highest safety, lowest complexity. No agents, no MCP, minimal moving parts.

Sovereign Core Airgap-ready

Components

Inference: llama.cpp (or vLLM if GPU)
API: LocalAI
UI: Jan or Text Generation WebUI
RAG: optional PrivateGPT
Security: periodic promptfoo runs

Profile B — Sovereign Dev Workstation/Homelab

Daily driver: local models, RAG, observability, local coding assistant.

Sovereign Core

Components

Inference: vLLM or SGLang
API: LocalAI (optional LiteLLM router)
UI: Open WebUI (patched + locked down) or AnythingLLM
Vectors: pgvector or Qdrant
RAG: Haystack or constrained LangChain
Observability: Phoenix or Langfuse (telemetry disabled)
Code AI: Tabby + Continue + Aider

Profile C — Quarantined Agentic Lab

Maximum capability, strictly sandboxed. MCP + agents + scanners are mandatory.

Hybrid Sandbox-required

Components

Everything from Profile B, plus:
Agents: LangGraph / CrewAI / AutoGen
MCP: MCP Inspector + FastMCP + strict server allowlists
Sandbox: Microsandbox for code execution, browsing, document parsing
Security: mcp-scan, Agentic Radar, AI-Infra-Guard, plus promptfoo red-teaming

Save & publish

Save this file as sovereign-local-ai-atlas.html and open it in any browser. If you want to host it, you can drop it into any static site host (GitHub Pages, Netlify, your own server).

Optional enhancements (still single-file)

Add a small JS search box to filter tool cards.
Add “copy link” buttons for section anchors.
Add printable styles with @media print.

End of document. Back to top ↑

1) How to read this atlas

Principle: Local ≠ Safe

Trust tiers are operational constraints

Recommended reading approach

2) Upstream “source feeds” (audited lists)

3) Layer: Inference & Runtime

Inference Engines (run the model)

OpenAI-Compatible Local APIs (make everything plug in)

Optional performance stacks (use only if you need them)

4) Layer: Human Interfaces (UIs)

General chat/workbench UIs

Developer/coding UIs

UI security notes (non-optional reading)

Open WebUI — Direct Connections CVE (code injection → takeover)

Ollama — massive exposure incidents (misconfiguration)

5) Layer: Data, Vectors, RAG

Vector stores (local-first)

RAG frameworks (local-capable, but must be constrained)

6) Layer: Memory (agent persistence)

Canonical memory index

Memory threat model (always apply)

7) Layer: Agents & Orchestration

Agent rule: no direct host access

8) Layer: MCP (Tools) + Sandbox

Official MCP specification

MCP policy you can actually enforce

Discovery feed (MCP servers)

9) Layer: Observability

Production deployment helpers (optional)

10) Layer: Evaluation & Security

Security is a stack (not a checkbox)

11) Layer: Code AI (local copilot)

Rule for code AI

12) Layer: Multimodal & Speech

13) Reference Profiles (Minimal / Dev / Lab)

Save & publish

Optional enhancements (still single-file)