todo

List

humbits to-do — tasks, research questions, and follow-ups.

Open-source PDF reading tools — is OCR the right tool?
Q: Are there open-source tools for reading PDFs? Maybe OCR isn't the best tool.

Finding: Right — OCR is the WRONG default for fillable/digital PDFs. Use a tiered strategy (most PDFs never need OCR):
1. AcroForm-first read. If the PDF has a form layer, read field name/type/rect directly — no inference. pdf-lib getForm().getFields() (already a humbits dep) or pdf.js page.getAnnotations(). Exact field names → map straight to the data schema.
2. Digital text, no form fields → text+layout heuristic (pdf.js getTextContent() / pdfplumber char boxes). This is essentially today's detectFields path.
3. Image-only scans → OCR/vision ONLY here: OCRmyPDF/Tesseract, PaddleOCR, or the current Claude vision.
Today @humbits/ocr defaults to Claude vision for PDFs, which burns cost+latency on machine-readable docs. Proposed next step: add an AcroForm-first reader to @humbits/ocr and only fall through to text-heuristic → vision. Full tool list (by case, with licenses) in the comment.