VibeHunt
Back to browse

Ollama v0.19

Massive local model speedup on Apple Silicon with MLX

Visit

The software runs large language models locally on macOS by leveraging Apple’s MLX framework, which provides a unified memory architecture and direct access to the GPU Neural Accelerators on M‑series chips. This integration yields a notable speedup in both time‑to‑first‑token and token‑generation rates, with benchmarked performance of roughly 1,850 tokens per second during prefill and 134 tokens per second during decode when using int4 quantization.

It also supports NVIDIA’s NVFP4 model format, allowing models to retain high accuracy while reducing memory bandwidth and storage demands. This enables users to run models that have been optimized with NVIDIA’s tooling and to achieve results comparable to production environments.

The caching system has been revised to reuse memory across conversations, employ intelligent checkpoint placement, and apply smarter eviction policies. These changes lower overall memory usage and improve responsiveness for coding assistants and other agentic tasks that involve branching prompts or shared system prompts.

Reviews

Sign in to leave a review.

Loading reviews…

Similar apps