Ollama v0.19

Massive local model speedup on Apple Silicon with MLX

The software runs large language models locally on macOS by leveraging Apple’s MLX framework, which provides a unified memory architecture and direct access to the GPU Neural Accelerators on M‑series chips. This integration yields a notable speedup in both time‑to‑first‑token and token‑generation rates, with benchmarked performance of roughly 1,850 tokens per second during prefill and 134 tokens per second during decode when using int4 quantization.

It also supports NVIDIA’s NVFP4 model format, allowing models to retain high accuracy while reducing memory bandwidth and storage demands. This enables users to run models that have been optimized with NVIDIA’s tooling and to achieve results comparable to production environments.

The caching system has been revised to reuse memory across conversations, employ intelligent checkpoint placement, and apply smarter eviction policies. These changes lower overall memory usage and improve responsiveness for coding assistants and other agentic tasks that involve branching prompts or shared system prompts.

Reviews

Loading reviews…

Similar apps

AI Coding Agents

Osaurus

LLM server built on MLX

Kimi K2.6

AI Coding Agents

Kimi K2.6

Open-source SOTA for long-horizon coding and agent swarms

AI Coding Agents

Msty

Run LLMs locally

AI Agents & Automation

Ollamac

Interact with Ollama models

STEM Tools & Simulations

Swama

Machine-learning runtime

AI Coding Agents

LM Studio

Discover, download, and run local LLMs