AI APIs and Scanned PDFs issue

Forum|Forum|1 year ago
January 22, 2025
2 replies
103 views

+1

Taylor Zwick
Practitioner ⭐️⭐️⭐️

I see that the API is updated with AI capabilities for documents. We tried that on some PDFs, and worked nicely. However, I have a particular PDF, which seems to be a scanned version, and the API is not able to read that. I am trying to convert the file to text, but the vision API only supports jpg and png files. Do you know if it is possible to convert pdfs to text somehow using the Cognite libraries (for instance contextualization)? Any ideas?

+8

Mithila Jayalath
Expert ⭐️⭐️⭐️⭐️
Forum|Forum|1 year ago
January 23, 2025

@Que Tran will you be able to help out here?

Like

Tommy Thorsen
Practitioner ⭐️⭐️⭐️
Forum|Forum|1 year ago
January 24, 2025

Hi @Taylor Zwick!

Maybe I can help, since I work in the team that owns this document processing backend.

Generally speaking, we do OCR on documents and are able to extract text from scanned documents. However, we have identified a problem with extracting text from pages with a mix of digital characters and images. For instance, if you have a scanned document where the scanning tool has added a header or a watermark or a page number, that trips our OCR process up.

Does your document match this description? If so, this might be your problem.

Obviously we are not happy with this situation and we are working to fix it. Our target is to have a fix ready by the end of Q1.

A temporary workaround you could do, if you wanted to test AI on a particular document, could be to use a third-party tool to add an OCR layer to your PDF. I recommend OCRmyPDF for this. You may want to use the --redo-ocr or --force-ocr flags as explained here.

Like

Sign up

Welcome to Cognite Hub