Ollama v0.19
Massive local model speedup on Apple Silicon with MLX
The software runs large language models locally on macOS by leveraging Apple’s MLX framework, which provides a unified memory architecture and direct access to the GPU Neural Accelerators on M‑series chips. This integration yields a notable speedup in both time‑to‑first‑token and token‑generation rates, with benchmarked performance of roughly 1,850 tokens per second during prefill and 134 tokens per second during decode when using int4 quantization.
It also supports NVIDIA’s NVFP4 model format, allowing models to retain high accuracy while reducing memory bandwidth and storage demands. This enables users to run models that have been optimized with NVIDIA’s tooling and to achieve results comparable to production environments.
The caching system has been revised to reuse memory across conversations, employ intelligent checkpoint placement, and apply smarter eviction policies. These changes lower overall memory usage and improve responsiveness for coding assistants and other agentic tasks that involve branching prompts or shared system prompts.
Reviews
Loading reviews…
Similar apps

AI Coding Agents
Osaurus
LLM server built on MLX
AI Coding Agents
Kimi K2.6
Open-source SOTA for long-horizon coding and agent swarms

AI Coding Agents
Msty
Run LLMs locally
AI Agents & Automation
Ollamac
Interact with Ollama models
STEM Tools & Simulations
Swama
Machine-learning runtime

AI Coding Agents
LM Studio
Discover, download, and run local LLMs