CDF Files: Adding File Size & Content Attributes (number of pages, file md5, file size, etc.)

Related products: API and SDKs

Request: CDF automatically generates key file attributes when a file is uploaded related to file content and size, including but not limited to:

  • file size
  • file md5
  • number of pages in a file

Use Cases: 

  • file md5 - to compare files and eliminate duplicates
  • file size - to identify broken files and auto-delete them
  • number of pages in a file - contextualization of 50+ page diagrams requires a different method (uses parse_diagrams) than contextualization of diagrams with less than 50 pages. Number of pages in a file will be necessary to determine what method of contextualization is needed and currently there is no way to do this (manually or automatically). 

Currently these attributes are implemented manually, but it would be incredibly useful and more efficient to have this information automatically available in technical workflows. Furthermore, in the case of contextualization of 50+ pages, it will be necessary. 

NewGathering Interest

Hi Gayatri,

It is not so trivial to add this info to the Files API, but we have all of this information inside the document processing system that exposes the Documents API. The page count field and the file size is already available in the Documents API. The hash is not there, but it would probably not be that hard to expose that as well. I can look into this.

Would it work for you to get this information from the Documents API? Bear in mind that the Documents API is eventually consistent with the Files API, and in some cases it can take some time from you upload something to the Files API until it is available in the Documents API.


Meh, it would work to have them in the Documents API, but with files being in the UI and in the SDK’s I would prefer to have it in Files API. Nonetheless, if it is more quickly accomplished by adding the hash to the documents it would be great to have in the interim. 


I guess it depends what kind of UI you are using, but at least Fusion has switched from using the Files API to use the Documents API in order to get those extra bits of functionality that the Documents API provides. Look at the screenshot below for proof 😄

I’m with you on the SDK support, though. It’s a real shame we have not been able to add Python SDK support for the Documents API yet.


ooohh I did not notice that! Thanks for showing! @Tommy Thorsen 


Gathering InterestPlanned for development

@Tommy Thorsen apologies if this is a basic question, but for the md5 of the file, this is generated based on the full contents of the file and not just the 1st MB of contents that is available to preview. 

Is this correct?


Hi @Bartosz Czernia, don't apologize for your question! It's great that you're reaching out to our community for help. Rest assured that we embrace all types of questions, no matter how basic they may seem. In fact, many others may have the same question but haven't asked yet. So by asking, you're not only helping yourself but also contributing to the collective knowledge of our community. We appreciate your engagement and encourage you to continue asking any questions you may have. We're here to support each other 🚀


@Bartosz Czernia the hash is made on the original binary content of the file, and not on the extracted plain text. Also, I don’t know if matters to you, but the hash is not an md5 hash, it’s a sha256.


The hash is now available from the Document Search API. See documentation here.


Planned for developmentImplemented