Vision Models for Scraping#

Sometimes data is in an image, a chart, or a highly obfuscated UI where DOM parsing fails.

Visual Extraction#

Pass a screenshot to a Vision-Language Model (VLM) and ask it to extract the data into JSON.

Open Weights Models#

MoonDream: Tiny (1.6B parameters), runs on CPU, great for simple OCR and answering basic questions about images.
LLaVA: Excellent open-source vision model.
Gemma 2 2B IT / Gemma4V: Google’s lightweight multimodal models.

Workflow#

Playwright takes a screenshot of the target element.
Send image to VLM with prompt: Extract the pricing tiers from this image as JSON.
Parse the output.