id: local-llms

Local LLMs#

Running LLMs locally means no API costs, full privacy (data never leaves your machine), and no rate limits. The trade-off is you need a decent GPU (or patience with CPU).

Hardware Requirements#

Model Size	RAM Needed	GPU VRAM	Speed (CPU)
1B–3B	4 GB	4 GB	Fast
7B–8B	8 GB	6–8 GB	Medium
13B	16 GB	10–12 GB	Slow
70B	64 GB	40 GB	Very slow

Most IIT Madras students: run 7B models on CPU (slow but works) or Google Colab T4 GPU.

Ollama (Recommended for Most Use Cases)#

Ollama is the easiest way to run local LLMs. It manages downloads, serves an OpenAI-compatible API, and handles model loading.

Install#

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: Download from https://ollama.com/download

Run Models#

# Download and run Llama 3.2 (3B, fast, fits in 4GB RAM)
ollama run llama3.2

# Gemma 3 (Google's model, great quality)
ollama run gemma3

# Qwen 2.5 (great for code)
ollama run qwen2.5-coder

# Mistral (good all-rounder)
ollama run mistral

# Chat interactively
>>> Tell me about FastAPI

Ollama API (OpenAI-Compatible)#

Ollama serves a REST API on port 11434 that’s compatible with the OpenAI SDK:

# Direct HTTP
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.2", "prompt": "What is Docker?", "stream": false}'

# List installed models
curl http://localhost:11434/api/tags

# Use with the official OpenAI SDK — just change the base_url
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",   # required but ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain Docker in 3 sentences."},
    ],
)
print(response.choices[0].message.content)

Integrate with FastAPI#

from fastapi import FastAPI
from openai import AsyncOpenAI
from pydantic import BaseModel

app = FastAPI()

# Point to local Ollama
llm_client = AsyncOpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

class ChatRequest(BaseModel):
    message: str
    model: str = "llama3.2"

@app.post("/chat")
async def chat(req: ChatRequest):
    response = await llm_client.chat.completions.create(
        model=req.model,
        messages=[{"role": "user", "content": req.message}],
    )
    return {"reply": response.choices[0].message.content}

LM Studio (GUI Approach)#

LM Studio provides a desktop app for downloading, managing, and chatting with local models. Ideal if you prefer a GUI over CLI.

Download from lmstudio.ai
Search and download a model (e.g., Qwen2.5-7B-Instruct-GGUF)
Load the model
Enable local server (Settings → Local Server → Start Server)
API available at http://localhost:1234/v1

# Same code, different base_url
client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio",
)

GGUF Format#

GGUF is the file format used for quantized local models. Quantization reduces model size:

Quantization	Size (7B model)	Quality	RAM
Q2_K	~2.7 GB	Low	4 GB
Q4_K_M	~4.4 GB	Good	6 GB
Q5_K_M	~5.1 GB	Better	8 GB
Q8_0	~7.7 GB	Best	10 GB
F16	~14 GB	Full	16 GB

For most use cases, Q4_K_M is the sweet spot: 4-bit quantization, good quality, fits in 6GB VRAM.

Find GGUF models on HuggingFace:

Search: bartowski/Llama-3.2-3B-Instruct-GGUF
Download the Q4_K_M version

llama.cpp (Maximum Control)#

llama.cpp is the C++ engine that Ollama uses under the hood. Use it directly when you need maximum performance or custom setups:

# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release

# Download a model
huggingface-cli download bartowski/Llama-3.2-3B-Instruct-GGUF \
  --include "Llama-3.2-3B-Instruct-Q4_K_M.gguf" \
  --local-dir models/

# Run inference server
./build/bin/llama-server \
  --model models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  --ctx-size 4096 \
  --port 8080

Ollama Quick Reference#

ollama list              # list installed models
ollama pull gemma3       # download model
ollama rm llama3.2       # remove model
ollama show llama3.2     # show model info
ollama ps                # list running models
ollama stop llama3.2     # unload from memory

# Run with options
ollama run llama3.2 --verbose   # show token stats

Video Reference#

Summary#

Tool	Best For
Ollama	Easy CLI, OpenAI-compatible API, most popular
LM Studio	GUI, easy model management, same API
llama.cpp	Maximum performance, custom builds
GGUF Q4_K_M	Best quality/size/speed balance