Local LLMs: Llamafile#
You would have heard of Large Language Models (LLMs) like GPT-4, Claude, and Llama. Some of these models are available for free, but most of them are not.
An easy way to run LLMs locally is Mozilla’s Llamafile. It’s a single executable file that works on Windows, Mac, and Linux. No installation or configuration needed - just download and run.
Watch this Llamafile Tutorial (6 min):
Here’s how to get started
- Download
Llama-3.2-1B-Instruct-Q6_K.llamafile(1.33 GB). - From the command prompt or terminal, run
Llama-3.2-1B-Instruct-Q6_K.llamafile. - Optional: For GPU acceleration, use
Llama-3.2-1B-Instruct-Q6_K.llamafile --n-gpu-layers 35. (Increase or decrease the number of layers based on your GPU VRAM.)
You might see a message like this:
██╗ ██╗ █████╗ ███╗ ███╗ █████╗ ███████╗██╗██╗ ███████╗
██║ ██║ ██╔══██╗████╗ ████║██╔══██╗██╔════╝██║██║ ██╔════╝
██║ ██║ ███████║██╔████╔██║███████║█████╗ ██║██║ █████╗
██║ ██║ ██╔══██║██║╚██╔╝██║██╔══██║██╔══╝ ██║██║ ██╔══╝
███████╗███████╗██║ ██║██║ ╚═╝ ██║██║ ██║██║ ██║███████╗███████╗
╚══════╝╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═╝╚══════╝╚══════╝
software: llamafile 0.9.2
model: Llama-3.2-1B-Instruct-Q6_K.gguf
compute: 13th Gen Intel Core i9-13900HX (alderlake)
server: http://127.0.0.1:8080/
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.You can now chat with the model. Type /exit or press Ctrl+C to stop.
You can also visit http://127.0.0.1:8080/ in your browser to chat with the model.
LlamaFile exposes an OpenAI compatible API. Here’s how to use it in Python:
import requests
response = requests.post(
"http://localhost:8080/v1/chat/completions",
headers={"Content-Type": "application/json"},
json={"messages": [{"role": "user", "content": "Write a haiku about coding"}]},
)
print(response.json()["choices"][0]["message"]["content"])Tools:
- OpenAI API compatibility: Use existing OpenAI code
- Creating your own llamafiles: Control output format
