VibeHunt
Back to browse

LLMKube

Kubernetes operator for llama.cpp-native LLM inference with GPU scheduling, Apple Silicon Metal support, and OpenAI-compatible API.

LLMKube provides a Kubernetes operator that automates the deployment of LLM inference services built on llama.cpp and compatible runtimes. Users define a Model and an InferenceService in YAML; the operator handles downloading, caching, health checks, scaling, and GPU scheduling across nodes, while exposing an OpenAI‑compatible HTTP API that works with existing SDKs and libraries such as LangChain.

The operator supports both NVIDIA CUDA GPUs and Apple Silicon Metal devices via a dedicated Metal Agent, allowing mixed‑hardware clusters to serve models without custom scripting. A CLI and Helm chart simplify installation on any Kubernetes cluster, and resources can be allocated with simple flags for CPU, memory, and GPU count. The project is open source under Apache‑2.0, self‑hostable, and free of subscription requirements.

Target users are teams that need to run private or cost‑controlled LLM workloads on their own infrastructure, including air‑gapped or multi‑GPU environments. LLMKube removes the need for manual Docker Compose setups, providing a declarative, Kubernetes‑native workflow for model versioning, scaling, and API exposure.

Reviews

Sign in to leave a review.

Loading reviews…

Similar apps