id: local-llms
Local LLMs#
Running LLMs locally means no API costs, full privacy (data never leaves your machine), and no rate limits. The trade-off is you need a decent GPU (or patience with CPU).
Hardware Requirements#
| Model Size | RAM Needed | GPU VRAM | Speed (CPU) |
|---|---|---|---|
| 1B–3B | 4 GB | 4 GB | Fast |
| 7B–8B | 8 GB | 6–8 GB | Medium |
| 13B | 16 GB | 10–12 GB | Slow |
| 70B | 64 GB | 40 GB | Very slow |
Most IIT Madras students: run 7B models on CPU (slow but works) or Google Colab T4 GPU.
Ollama (Recommended for Most Use Cases)#
Ollama is the easiest way to run local LLMs. It manages downloads, serves an OpenAI-compatible API, and handles model loading.
Install#
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: Download from https://ollama.com/downloadRun Models#
# Download and run Llama 3.2 (3B, fast, fits in 4GB RAM)
ollama run llama3.2
# Gemma 3 (Google's model, great quality)
ollama run gemma3
# Qwen 2.5 (great for code)
ollama run qwen2.5-coder
# Mistral (good all-rounder)
ollama run mistral
# Chat interactively
>>> Tell me about FastAPIOllama API (OpenAI-Compatible)#
Ollama serves a REST API on port 11434 that’s compatible with the OpenAI SDK:
# Direct HTTP
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.2", "prompt": "What is Docker?", "stream": false}'
# List installed models
curl http://localhost:11434/api/tags# Use with the official OpenAI SDK — just change the base_url
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required but ignored by Ollama
)
response = client.chat.completions.create(
model="llama3.2",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain Docker in 3 sentences."},
],
)
print(response.choices[0].message.content)Integrate with FastAPI#
from fastapi import FastAPI
from openai import AsyncOpenAI
from pydantic import BaseModel
app = FastAPI()
# Point to local Ollama
llm_client = AsyncOpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama",
)
class ChatRequest(BaseModel):
message: str
model: str = "llama3.2"
@app.post("/chat")
async def chat(req: ChatRequest):
response = await llm_client.chat.completions.create(
model=req.model,
messages=[{"role": "user", "content": req.message}],
)
return {"reply": response.choices[0].message.content}LM Studio (GUI Approach)#
LM Studio provides a desktop app for downloading, managing, and chatting with local models. Ideal if you prefer a GUI over CLI.
- Download from lmstudio.ai
- Search and download a model (e.g.,
Qwen2.5-7B-Instruct-GGUF) - Load the model
- Enable local server (Settings → Local Server → Start Server)
- API available at
http://localhost:1234/v1
# Same code, different base_url
client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="lm-studio",
)GGUF Format#
GGUF is the file format used for quantized local models. Quantization reduces model size:
| Quantization | Size (7B model) | Quality | RAM |
|---|---|---|---|
| Q2_K | ~2.7 GB | Low | 4 GB |
| Q4_K_M | ~4.4 GB | Good | 6 GB |
| Q5_K_M | ~5.1 GB | Better | 8 GB |
| Q8_0 | ~7.7 GB | Best | 10 GB |
| F16 | ~14 GB | Full | 16 GB |
For most use cases, Q4_K_M is the sweet spot: 4-bit quantization, good quality, fits in 6GB VRAM.
Find GGUF models on HuggingFace:
- Search:
bartowski/Llama-3.2-3B-Instruct-GGUF - Download the
Q4_K_Mversion
llama.cpp (Maximum Control)#
llama.cpp is the C++ engine that Ollama uses under the hood. Use it directly when you need maximum performance or custom setups:
# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release
# Download a model
huggingface-cli download bartowski/Llama-3.2-3B-Instruct-GGUF \
--include "Llama-3.2-3B-Instruct-Q4_K_M.gguf" \
--local-dir models/
# Run inference server
./build/bin/llama-server \
--model models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
--ctx-size 4096 \
--port 8080Ollama Quick Reference#
ollama list # list installed models
ollama pull gemma3 # download model
ollama rm llama3.2 # remove model
ollama show llama3.2 # show model info
ollama ps # list running models
ollama stop llama3.2 # unload from memory
# Run with options
ollama run llama3.2 --verbose # show token statsVideo Reference#
Summary#
| Tool | Best For |
|---|---|
| Ollama | Easy CLI, OpenAI-compatible API, most popular |
| LM Studio | GUI, easy model management, same API |
| llama.cpp | Maximum performance, custom builds |
| GGUF Q4_K_M | Best quality/size/speed balance |
