Vision Models for Scraping#
Sometimes data is in an image, a chart, or a highly obfuscated UI where DOM parsing fails.
Visual Extraction#
Pass a screenshot to a Vision-Language Model (VLM) and ask it to extract the data into JSON.
Open Weights Models#
- MoonDream: Tiny (1.6B parameters), runs on CPU, great for simple OCR and answering basic questions about images.
- LLaVA: Excellent open-source vision model.
- Gemma 2 2B IT / Gemma4V: Google’s lightweight multimodal models.
Workflow#
- Playwright takes a screenshot of the target element.
- Send image to VLM with prompt:
Extract the pricing tiers from this image as JSON. - Parse the output.