Local LLMs are incredibly powerful tools, but it can be hard to put smaller models to good use in certain contexts. With fewer parameters, they often know less, though you can improve their capabilities with a search engine that’s accessible over MCP. As it turns out, though, you can host a 120B parameter model on a GPU with just 24GB of VRAM, paired with 64GB of regular system RAM, and it’s fast enough to be usable for voice assistants, smart home automation, and more. For reference, on 24GB of VRAM, the most practical dense model you’ll typically be able to fit will be a quantized 27 billion parameter model, accounting for the memory needed to hold the context window, too.