Ollama GPU Memory Bug: Unable To Allocate CUDA Buffer Error Explained

Dec 16, 2025 by Alex Johnson 72 views

Have you ever been deep into a project with Ollama, loaded up a couple of hefty models, and then hit a brick wall when trying to load a smaller one? You might see an error message like "unable to allocate CUDA buffer," and it can be super confusing, especially when your GPUs look like they have plenty of space. Well, it turns out there's a sneaky bug in how Ollama calculates GPU memory, and it's causing this exact headache for some users, particularly when dealing with small models after large ones have already taken up residence.

This article dives deep into this specific Ollama integer underflow bug, explaining why it happens, how it tricks Ollama into thinking there's tons of free memory when there isn't, and what you can do about it. We'll break down the technical details in a way that's understandable, even if you're not a deep CUDA expert, and offer practical workarounds to keep your workflow smooth.

The Mysterious "Unable to Allocate CUDA Buffer" Error

The core of the problem lies in how Ollama manages and allocates memory on your Graphics Processing Units (GPUs). When you load a model, Ollama needs to reserve a certain amount of VRAM (Video Random Access Memory) on your GPU. If it can't find enough contiguous free memory for the model's layers, it throws the dreaded "unable to allocate CUDA buffer" error. This usually signifies a genuine lack of VRAM. However, in this specific scenario, Ollama is being misled by a calculation error.

Imagine you have three powerful NVIDIA L40S GPUs, each with a generous 48GB of VRAM. You load two massive models, say a 70B parameter model (llama3.1:70b) and then a 32B model (qwen2.5:32b). These big boys gobble up a significant chunk of your VRAM, leaving, for example, GPU 0 with about 357 MiB free, GPU 1 with 21 GB free, and GPU 2 with a mere 1.4 GB free. Now, you decide to load a much smaller model, like a 3B parameter one (llama3.2:3b). Intuitively, this tiny model should fit somewhere, maybe even on the GPUs with more space, or at least trigger Ollama's logic to unload an older model to make room, especially since you've configured OLLAMA_MAX_LOADED_MODELS=3, meaning you can load one more.

Instead of a smooth load, Ollama falters. It attempts to load the small model, and here's where the magic (or rather, the bug) happens. Due to an integer underflow, Ollama miscalculates the available memory on GPU 0. Instead of seeing the actual ~357 MiB free, it somehow believes there are approximately 17 exabytes of free memory available on that GPU! This astronomically incorrect figure leads Ollama to assign all layers of the small model to GPU 0, completely ignoring the other GPUs that actually have space. When the CUDA driver tries to allocate the requested memory on GPU 0, it fails because, in reality, there's only a tiny sliver of VRAM left. This mismatch between Ollama's perceived free memory and the actual free memory is the direct cause of the "unable to allocate CUDA buffer" error, even though the error message itself doesn't explicitly point to the calculation bug.

This behavior is particularly frustrating because it's the opposite of what you might expect. Typically, you'd worry about running out of memory when loading large models. Here, loading a small model after large ones exposes the flaw. The bug is triggered precisely because the small model appears to fit within the OLLAMA_MAX_LOADED_MODELS limit, preventing Ollama's auto-unload mechanism from kicking in prematurely. The faulty memory calculation then proceeds, leading to the allocation failure.

Diving into the Debug Logs: Where the Bug Hides

When faced with puzzling errors, enabling debug logs is often your best friend. In this case, with OLLAMA_DEBUG=1 enabled, the culprit becomes glaringly obvious in the log output. You'll see lines that look something like this:

time=2025-12-16T07:36:37.183Z level=INFO msg="updated VRAM" gpu=GPU-5320d871... "available=357.9 MiB"
time=2025-12-16T07:36:37.183Z level=DEBUG source=server.go:921 msg="available gpu" id=GPU-5320d871... "available layer vram"="17179869183.3 GiB"
time=2025-12-16T07:36:37.183Z level=INFO msg=load request="GPULayers:33[ID:GPU-5320d871... Layers:33(0..32)]"

Let's break down what's happening here. The first line confirms the reality: GPU-5320d871... (your GPU) has 357.9 MiB of VRAM available. This is a small amount, but it's the actual state. Then comes the shocking part:

"available layer vram"="17179869183.3 GiB"

This value is mind-bogglingly large – roughly 17 exabytes (an exabyte is a billion gigabytes). This is where the integer underflow makes its dramatic appearance. Ollama, in its attempt to calculate how much memory is truly available for model layers after accounting for overheads and other reservations, performs a calculation that goes wrong when the initial free memory is very low.

Following this, the log shows load request="GPULayers:33[ID:GPU-5320d871... Layers:33(0..32)]". This confirms that Ollama decided to attempt loading all 33 layers of the small model onto this specific GPU (GPU 0), because its faulty calculation suggested there was ample space. The actual CUDA allocation then fails, triggering the user-facing error.

This debug output is crucial evidence. It directly links the unable to allocate CUDA buffer error not to a genuine lack of VRAM on all GPUs, but to a flawed internal calculation that misdirects the loading process to an already near-full GPU. The bug essentially blinds Ollama to the actual memory landscape, leading it to make an impossible allocation request.

The Root Cause: An Integer Underflow Explained

The integer underflow bug occurs within Ollama's GPU memory calculation logic, specifically in the buildLayout function found in llm/server.go. This function is responsible for determining how model layers can be distributed across available GPUs, taking into account various memory constraints and overheads.

The problematic code snippet looks something like this:

reserved := uint64(float32(gl[i].FreeMemory)*backoff) + gl[i].MinimumMemory() + envconfig.GpuOverhead() + memory.GPUs[j].Graph
if gl[i].FreeMemory > reserved {
    gl[i].FreeMemory -= reserved
} else {
    gl[i].FreeMemory = 0
}

Here's where the trouble starts: The calculation of reserved involves gl[i].FreeMemory, which represents the memory currently available on a specific GPU. When this function is called multiple times for the same GPU (which can happen in certain internal loops, especially if the memory.GPUs structure isn't perfectly de-duplicated or if the logic iterates over potential assignments for a single device), and the GPU is already nearly full, the calculation can lead to underflow. Let's trace a hypothetical scenario:

First Iteration for GPU 0: Suppose gl[i].FreeMemory is 357.9 MiB. The reserved amount (calculated from various factors like backoff, MinimumMemory, GpuOverhead, and Graph memory) might be, say, 457 MiB plus some gigabytes for Graph memory. Since 357.9 MiB is less than reserved, the else block is executed, and gl[i].FreeMemory is correctly set to 0.
Second Iteration for the same GPU 0: Now, gl[i].FreeMemory is 0. The code attempts to calculate reserved again. The critical issue arises not from directly subtracting from 0 in the if-else structure (which prevents a negative number), but from how intermediate or final values are handled, especially when combined with the large memory.GPUs[j].Graph value. When gl[i].FreeMemory is 0 and is then involved in calculations that conceptually represent subtracting a large value (or when the Graph value itself is extremely large and leads to an overflow/underflow when combined with other factors), the resulting value stored in gl[i].FreeMemory can wrap around. This is where the uint64 type plays a role. If you try to subtract a large number from a small number (or zero) using unsigned integers, the result