Document Parsing#
Real-world data is trapped in PDFs, Word docs, and messy HTML.
Unstructured.io#
A unified API for parsing PDFs, PPTX, HTML, etc., into clean JSON.
LlamaParse#
Specifically designed to parse complex PDFs with tables and charts into Markdown for LLM ingestion.
Surya OCR#
An open-source, highly accurate multi-lingual OCR model that outperforms Tesseract.
HTML to Markdown#
Always convert HTML to Markdown before feeding it to an LLM. It saves tokens and removes noise.