The Elephant in the Room: Running LLMs on Small Hardware

Video • 2:17

In this video, Christopher Brooks, Associate Professor of Information, discusses how to run large language models (LLMs) on commodity hardware.

Excerpt From

Llama for Python Programmers

Course

Transcript

We all want to get our hands on the code and start using LLMs locally on our desktops or laptops or even our phones, but before we do that we've got to address the elephant in the room: how can we actually run these large models on commodity hardware? While this does depend on the architecture of the model, especially with some of the new models which are ensembles like the MixT models, we can estimate the memory needed by taking the amount of parameters of the model and multiplying it by four bytes, assuming each parameter is represented by a 32-bit floating-point value. This means that the 7 billion parameter model needs to have 28 GB of memory to be fully loaded for inference, and that doesn't include any additional overhead needed by your application. What's even worse is that this computation is really best done on a graphics processing unit, so we're not talking about desktop or laptop RAM, we're talking about video card VRAM. At the time of filming, the top consumer video card for this is the Nvidia GeForce RTX 4090, and it only has 24 GB of memory—not even enough to run the smallest of the Llama 2 models. But we can trim the model size down, and there's a couple of techniques that we can use to do this. The starting point for bringing our models down in size is to quantize the model weights. Instead of storing each weight as a 32-bit float, we could drop this to say 16-bit floats and cut the size of the VRAM in half. The result is that now we can fit the smallest Lama 2 model on a consumer-grade video card for inference, and what we lose is some precision in the model. Now we can make this trick work even further: why store each weight in two bytes if we could reduce it to a single byte, or maybe half a byte—a nibble? Well, the accuracy trade-off is an important one here, but in practice, it turns out that the Llama models continue to perform well when we quantize them down to four, five, or six bits per parameter, and that's a huge savings compared to the 32 bits per parameter we were starting with.