todo
Listhumbits to-do — tasks, research questions, and follow-ups.
Q: Are there open-source tools for reading PDFs? Maybe OCR isn't the best tool.
Finding: Right — OCR is the WRONG default for fillable/digital PDFs. Use a tiered strategy (most PDFs never need OCR):
- AcroForm-first read. If the PDF has a form layer, read field name/type/rect directly — no inference. pdf-lib
getForm().getFields()(already a humbits dep) or pdf.jspage.getAnnotations(). Exact field names → map straight to the data schema. - Digital text, no form fields → text+layout heuristic (pdf.js
getTextContent()/ pdfplumber char boxes). This is essentially today's detectFields path. - Image-only scans → OCR/vision ONLY here: OCRmyPDF/Tesseract, PaddleOCR, or the current Claude vision.
Today @humbits/ocr defaults to Claude vision for PDFs, which burns cost+latency on machine-readable docs. Proposed next step: add an AcroForm-first reader to @humbits/ocr and only fall through to text-heuristic → vision. Full tool list (by case, with licenses) in the comment.
- pdf-lib PDFForm API (AcroForm read, JS — already a dep)
- pdf.js — getAnnotations / getTextContent (JS)
- pdfplumber — per-char boxes for layout heuristics (Py)
- IBM Docling — ML layout/table extraction (Py)
- OCRmyPDF — Tesseract text layer for scans (last resort)
- AcroForm-first read. If the PDF has a form layer, read field name/type/rect directly — no inference. pdf-lib