Open-source PDF reading tools — is OCR the right tool?
Seen in 1 project by 1 person
About
Q: Are there open-source tools for reading PDFs? Maybe OCR isn't the best tool.
Finding: Right — OCR is the WRONG default for fillable/digital PDFs. Use a tiered strategy (most PDFs never need OCR):
- AcroForm-first read. If the PDF has a form layer, read field name/type/rect directly — no inference. pdf-lib
getForm().getFields()(already a humbits dep) or pdf.jspage.getAnnotations(). Exact field names → map straight to the data schema. - Digital text, no form fields → text+layout heuristic (pdf.js
getTextContent()/ pdfplumber char boxes). This is essentially today's detectFields path. - Image-only scans → OCR/vision ONLY here: OCRmyPDF/Tesseract, PaddleOCR, or the current Claude vision.
Today @humbits/ocr defaults to Claude vision for PDFs, which burns cost+latency on machine-readable docs. Proposed next step: add an AcroForm-first reader to @humbits/ocr and only fall through to text-heuristic → vision. Full tool list (by case, with licenses) in the comment.
Links
Listed in
Bookmarked in
Not in any public bookmark categories yet.