Hi Raj!
My name is Tommy and I’m the tech lead for the team that owns the Documents API. I’ll try to answer your questions as best I can:
Can we go beyond the 1MB limit to allow OCR, index and search of any free text within the documents. Is this configurable?
The 1MB limit is not on the source file, it is on the extracted plain text. The source files can be as large as 2GB, which I think should include most things we normally call documents.
The 1MB limit on the extracted plain text is also intended to mainly protect our systems against files that aren’t really documents. A regular document will almost never have more than 1MB of plain text in it.
Let’s try some really rough math: A common character count estimate for a page that contains nothing but text, is 3000 characters per page. Looking a bit closer at the implementation, it seems our limit is not actually 1MB of text, but 1 million characters. (This can matter if you have many multi-byte characters in your documents). Anyway, 1M characters amounts to over 300 pages of nothing but text. If you consider that most documents have images and illustrations that take up a good amount of space, the real max page count is much higher.
In practice, we very rarely see documents that actually hit this limit. If you do have lot of documents that actually do end up getting truncated, let me know, and we will probably be able to do something about it.
LLM/NLP Search: What level of fuzzy or NLP searches are supported on the indexed content of a document. For example could one search for a phrase that isn’t verbatim in the file, but close enough in meaning or spread out within a couple of places within the document.
We do some NLP steps to enable some degree of fuzziness on the indexed content:
- We auto-detect the language of the document. Based on the language we may do some additional processing:
- Case normalization (allowing for case insensitive search)
- Stop word removal (removing little filler words from the indexed content and from the queries)
- Stemming (replacing word forms with their stems, to be able to match for instance “pumps” with “pumping”)
- Breaking up compound words for germanic languages
These NLP features are currently available for English, German, Norwegian and Japanese.
Automation: I understand we can auto-extract file/folder names and even text from the initial portion of a document to create metadata. Can this metadata be used in an automated entity matching service to automatically relate (contextualize) a new document to the parent field, well, facility etc.
- Can the same automation work by purely relying on the text within the OCRed document to avoid the human having to categorize into folders and ensuring naming standards for files in first place. What are the capabilities/limits we should be aware of.
The document processing pipeline does do a fully automated contextualization step after extracting the plain text from a document; In the extracted plain text, we search for any asset names (from the Assets API, no data modelling support quite yet), and automatically connect documents to assets based on these detected mentions. These connections are returned as a list of asset ids from the Documents API.