Impact 2024: The Industrial Data and AI Conference for and by Users | Nominate Speakers Now for a Ch...
Planned for development→Implemented
The hash is now available from the Document Search API. See documentation here.
Hi Raj!My name is Tommy and I’m the tech lead for the team that owns the Documents API. I’ll try to answer your questions as best I can: Can we go beyond the 1MB limit to allow OCR, index and search of any free text within the documents. Is this configurable? The 1MB limit is not on the source file, it is on the extracted plain text. The source files can be as large as 2GB, which I think should include most things we normally call documents.The 1MB limit on the extracted plain text is also intended to mainly protect our systems against files that aren’t really documents. A regular document will almost never have more than 1MB of plain text in it.Let’s try some really rough math: A common character count estimate for a page that contains nothing but text, is 3000 characters per page. Looking a bit closer at the implementation, it seems our limit is not actually 1MB of text, but 1 million characters. (This can matter if you have many multi-byte characters in your documents). Anyway, 1M c
Hi Robert!I strongly agree with you, and I have been making the same argument myself recently. I already put an item on the backlog to make this change.I don’t have a specific timeline for this yet, but your comment will certainly help get this prioritized higher.Thanks 😊
@Bartosz Czernia the hash is made on the original binary content of the file, and not on the extracted plain text. Also, I don’t know if matters to you, but the hash is not an md5 hash, it’s a sha256.
I guess it depends what kind of UI you are using, but at least Fusion has switched from using the Files API to use the Documents API in order to get those extra bits of functionality that the Documents API provides. Look at the screenshot below for proof 😄I’m with you on the SDK support, though. It’s a real shame we have not been able to add Python SDK support for the Documents API yet.
Hi Gayatri,It is not so trivial to add this info to the Files API, but we have all of this information inside the document processing system that exposes the Documents API. The page count field and the file size is already available in the Documents API. The hash is not there, but it would probably not be that hard to expose that as well. I can look into this.Would it work for you to get this information from the Documents API? Bear in mind that the Documents API is eventually consistent with the Files API, and in some cases it can take some time from you upload something to the Files API until it is available in the Documents API.
Already have an account? Login
Enter your username or e-mail address. We'll send you an e-mail with instructions to reset your password.
Sorry, we're still checking this file's contents to make sure it's safe to download. Please try again in a few minutes.
Sorry, our virus scanner detected that this file isn't safe to download.