Skip to main content

AI APIs and Scanned PDFs issue

  • January 22, 2025
  • 2 replies
  • 30 views

Taylor Zwick
Practitioner
Forum|alt.badge.img

I see that the API is updated with AI capabilities for documents. We tried that on some PDFs, and worked nicely. However, I have a particular PDF, which seems to be a scanned version, and the API is not able to read that. I am trying to convert the file to text, but the vision API only supports jpg and png files. Do you know if it is possible to convert pdfs to text somehow using the Cognite libraries (for instance contextualization)? Any ideas?

2 replies

Mithila Jayalath
Seasoned Practitioner
Forum|alt.badge.img

@Que Tran will you be able to help out here?


Tommy Thorsen
Practitioner

Hi ​@Taylor Zwick!

Maybe I can help, since I work in the team that owns this document processing backend.

Generally speaking, we do OCR on documents and are able to extract text from scanned documents. However, we have identified a problem with extracting text from pages with a mix of digital characters and images. For instance, if you have a scanned document where the scanning tool has added a header or a watermark or a page number, that trips our OCR process up.

Does your document match this description? If so, this might be your problem.

Obviously we are not happy with this situation and we are working to fix it. Our target is to have a fix ready by the end of Q1.

A temporary workaround you could do, if you wanted to test AI on a particular document, could be to use a third-party tool to add an OCR layer to your PDF. I recommend OCRmyPDF for this. You may want to use the --redo-ocr or --force-ocr flags as explained here.


Reply


Cookie Policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie Settings