Awesome Local AI (ethicals7s)
Large “local AI” index; high recall; requires re-verification of locality claims.
Sovereign / Local-First AI • Atlas • Guide • Library
A practical, readable map of the modern local-AI stack — from inference engines to UIs, RAG + memory, agents + MCP tooling, and security + evaluation. Every tool includes inline links where it appears, plus hardening notes for high-risk surfaces.
The atlas is intentionally structured as layers. The correct way to use it is to build “downward”: choose your inference engine → expose a local API → attach UIs → attach data stores → add RAG/memory → add agents/tools (MCP) → instrument → attack/test.
“Self-hosted” tools can still be exploited via misconfiguration, exposed ports, malicious tool servers, or supply chain. This atlas therefore includes security notes and scanner tooling inline, not as an appendix.
These are discovery surfaces we mined and then filtered. They are useful for breadth, but none are treated as authoritative without verification.
Large “local AI” index; high recall; requires re-verification of locality claims.
Strong taxonomy across local LLM engines, UIs, RAG, agents, and more.
Local tooling list with excellent coverage of inference engines and local-first apps.
Ops + scaling + monitoring + memory + security tools for production agentic systems.
Huge catalog of MCP tool servers; must be filtered hard by locality & blast radius.
Historic snapshot of coding tools; archived Feb 23, 2026 (read-only).
This layer answers: Where do tokens come from? You need at least one inference engine plus a stable interface that higher layers can consume. In practice that means:
Minimal C/C++ inference for GGUF models; runs across CPU/GPU; ideal baseline engine.
High-throughput, memory-efficient inference & serving engine; production GPU workhorse.
High-performance serving framework for LLMs + multimodal; strong for structured/tool outputs.
Compiler + deployment engine for running models natively across devices (including browser/edge).
Use when you want sovereignty to extend to phones/devices/browsers without centralized servers.
Apple’s array framework for ML on Apple Silicon; pairs with MLX-LM for local LLM work.
Open-source OpenAI-compatible REST API for local inference (text, images, audio).
Local model runner + API; convenient for desktops and small servers — but must be network-hardened.
Proxy/gateway: routes many model providers through a unified OpenAI-style interface (local + cloud).
High-performance inference optimizations for LLMs on NVIDIA GPUs.
General-purpose inference server for many frameworks (TensorRT, PyTorch, ONNX, etc.).
Treat Triton as high-value infra. Keep it internal-only and patched; enterprise-grade servers attract enterprise-grade attacks.
UIs are deceptively dangerous: they hold tokens/keys, render untrusted content, and often provide “direct connections” to external model servers. Treat UIs like you would treat your admin panels.
Self-hosted AI platform designed to operate offline; supports Ollama + OpenAI-compatible APIs.
CVE-2025-64496 describes a Direct Connections code-injection vulnerability via SSE events that can lead to token theft, account takeover, and—when chained—backend RCE. Patch + reduce attack surface.
Feature-rich local AI web UI; supports GGUF via llama.cpp and many backends.
Full-stack “private ChatGPT” for chatting with docs; supports many local LLMs + vector DBs.
Use AnythingLLM when you want an integrated RAG workspace with minimal wiring. Treat it as a UI layer over your own local endpoints.
Open-source ChatGPT alternative aimed at fully local/offline usage with strong privacy posture.
Local model UI + local server; can operate entirely offline (closed-source app).
Useful as a convenience interface on individual machines. Do not make it a single point of failure or the only ingress to your models.
Open-source IDE assistant; point it at LocalAI/vLLM/Ollama for local-first coding workflows.
Terminal-based AI pair programming; edits your repo through git-friendly patches.
Direct Connections lets Open WebUI connect to external model servers. CVE-2025-64496 documents a code injection route via SSE events that can steal auth tokens and lead to takeover and backend RCE when chained.
Security researchers documented ~175,000 publicly exposed Ollama hosts due to unsafe network binding. Treat model servers as internal services; never expose raw endpoints.
This layer answers: Where does context come from? You need (a) storage (structured/unstructured), (b) embeddings/vector search, (c) retrieval pipelines, and (d) document ingestion.
Open-source vector DB + similarity search (Rust); self-hostable, scalable.
Open-source, scalable vector DB; runs from laptop to distributed systems.
Vector similarity search inside Postgres; keep vectors with the rest of your data.
In-process SQL OLAP DB (MIT); ideal for local analytics and embedded pipelines.
OSS embedded retrieval + vector search for multimodal AI data; SQL + vectors.
Fully managed, serverless vector DB (cloud). Included for boundary awareness only.
It’s widely used in RAG stacks; we include it only to explicitly mark “cloud vector DB” as a Synthetic boundary option.
Popular orchestration + RAG framework; huge integrations ecosystem (local + cloud).
Stateful agent/workflow graphs (LangChain ecosystem); ideal for long-running, controllable flows.
RAG & data framework; strong indexing abstractions; supports LocalAI via OpenAI-compatible interface.
Open-source orchestration framework for RAG and agents; modular and transparent pipelines.
Production-ready private document Q&A that can run without internet; “no data leaves environment”.
Memory is not “chat history.” It’s a system: what gets stored, how it’s retrieved, how it’s updated, and how it’s prevented from becoming an injection vector.
The most comprehensive resource index we audited: IAAR-Shanghai / Awesome-AI-Memory. Use it as the research+pattern map; keep implementations inside your own perimeter.
“Universal memory layer” for agents; stores and recalls personalized context across sessions.
Memory Operating System: unified store/retrieve/manage for long-term agent memory.
Context engineering & memory platform; assembles relevant context from histories and data sources.
Temporally-aware knowledge graphs for agents in dynamic environments (memory beyond vectors).
Agents are where systems become dangerous: they decide, call tools, write files, run code, and mutate state. The job of this layer is to make that power bounded, observable, and attack-tested.
Stateful workflow graphs for long-running agents; good for explicit control and failure boundaries.
Lean multi-agent framework; role-based orchestration with high-level + low-level control.
Multi-agent framework for autonomous or human-in-the-loop workflows.
Agents should operate in containers/microVMs with scoped filesystem mounts and default-deny egress. If you must allow tool access, do it via MCP servers that enforce least privilege (see next layer).
MCP (Model Context Protocol) is the “tool port” that connects models to resources and actions. This layer exists because the moment an agent can use tools, it can exfiltrate, corrupt, and escalate. Treat MCP servers like plugins with code execution risk.
Start here for protocol-level grounding: modelcontextprotocol.io and the specification.
Visual testing + proxy tool to run and debug MCP servers safely during development.
Pythonic way to build MCP servers; widely used in MCP ecosystems.
Self-hosted microVM sandbox for running untrusted workloads fast with strong isolation.
Sandboxing is your “blast radius limiter” for agent code execution, browser tools, file conversion, and data analysis.
Security scanner for MCP servers (prompt injection, tool poisoning, escalation patterns).
Massive catalog: punkpeye/awesome-mcp-servers. Use it only after applying the policy above.
Observability is not optional: agents are multi-step systems. If you cannot trace calls, tool usage, retrieval, and state transitions, you cannot prove what happened — and you cannot harden.
Open-source LLM observability + prompt management + eval; self-hostable.
Langfuse states that by default it reports basic usage statistics of self-hosted instances to PostHog. For sovereign deployments, explicitly disable analytics and enforce with network egress controls.
Open-source AI observability: tracing, evaluation, debugging for LLM apps and agents.
Open-source analytics platform; can be self-hosted and used as an internal metrics sink.
If you use analytics at all, run it inside your own perimeter. Do not leak LLM traces or prompts to third-party SaaS by accident.
This layer exists to answer: Can the system be tricked? You need continuous red-teaming, scanner tooling, and repeatable evaluation harnesses.
Evaluate prompts/agents/RAG; includes red teaming and vulnerability scanning workflows.
Security scanner for agent workflows; generates reports on vulnerabilities and operational risks.
Full-stack AI red teaming: infra scan, MCP scan, agent skills scan, jailbreak evaluation.
Scanner designed to audit installed MCP servers for prompt injection, poisoning, escalation patterns.
Coding AI is a special case: the data (your codebase) is extremely sensitive, and the tool has direct write access to production systems. This atlas therefore treats local, self-hostable code AI as default.
Self-hosted AI coding assistant; open-source alternative to GitHub Copilot.
IDE front-end; pair it with Tabby + LocalAI for a fully local “copilot stack”.
CLI refactor agent; keeps work repo-scoped and git-aware (diffs/commits).
If you use any cloud code assistant at all, do it in a separate, quarantined environment (no secrets, no prod keys, no canonical repos). Keep the sovereign codebase local-first.
Historic catalog (archived): sourcegraph/awesome-code-ai.
Sovereignty requires voice and vision to remain local as well. These tools provide offline speech-to-text, text-to-speech, and vision-language capability.
C/C++ port of Whisper for local speech-to-text; runs offline.
Open-source deep learning TTS toolkit; inference + training/fine-tuning.
Vision-language model project (NeurIPS 2023); local VLM building block.
Small vision-language model designed for edge efficiency and broad device support.
These profiles help you pick an appropriate complexity level. Each is a “small set of parts” drawn from the layers above.
Highest safety, lowest complexity. No agents, no MCP, minimal moving parts.
Daily driver: local models, RAG, observability, local coding assistant.
Maximum capability, strictly sandboxed. MCP + agents + scanners are mandatory.
Save this file as sovereign-local-ai-atlas.html and open it in any browser.
If you want to host it, you can drop it into any static site host (GitHub Pages, Netlify, your own server).
@media print.End of document. Back to top ↑