Open-source PDF reading tools — is OCR the right tool?

Seen in 1 project by 1 person

About

Q: Are there open-source tools for reading PDFs? Maybe OCR isn't the best tool.
Finding: Right — OCR is the WRONG default for fillable/digital PDFs. Use a tiered strategy (most PDFs never need OCR):
AcroForm-first read. If the PDF has a form layer, read field name/type/rect directly — no inference. pdf-lib getForm().getFields() (already a humbits dep) or pdf.js page.getAnnotations(). Exact field names → map straight to the data schema.
Digital text, no form fields → text+layout heuristic (pdf.js getTextContent() / pdfplumber char boxes). This is essentially today's detectFields path.
Image-only scans → OCR/vision ONLY here: OCRmyPDF/Tesseract, PaddleOCR, or the current Claude vision.
Today @humbits/ocr defaults to Claude vision for PDFs, which burns cost+latency on machine-readable docs. Proposed next step: add an AcroForm-first reader to @humbits/ocr and only fall through to text-heuristic → vision. Full tool list (by case, with licenses) in the comment.

Listed in

todo — humbits@yancou30

Bookmarked in

Not in any public bookmark categories yet.

About

Links

Listed in

Bookmarked in