Skip to main content
Implemented

CDF Files: Adding File Size & Content Attributes (number of pages, file md5, file size, etc.)

Related products:API and SDKs
Noah Karsky
Aditya Kotiyal
+5
  • Noah Karsky
    Noah Karsky
  • Diego Garzon
  • Bartosz Czernia
  • Aditya Kotiyal
    Aditya Kotiyal
  • Gayatri Babel
  • Sarah Behrens
  • Abram Ziegelaar
    Abram Ziegelaar
  • Asan Arifov
  • Mary Spanjers
  • Rajeev z ranjan

Request: CDF automatically generates key file attributes when a file is uploaded related to file content and size, including but not limited to:

  • file size
  • file md5
  • number of pages in a file

Use Cases: 

  • file md5 - to compare files and eliminate duplicates
  • file size - to identify broken files and auto-delete them
  • number of pages in a file - contextualization of 50+ page diagrams requires a different method (uses parse_diagrams) than contextualization of diagrams with less than 50 pages. Number of pages in a file will be necessary to determine what method of contextualization is needed and currently there is no way to do this (manually or automatically). 

Currently these attributes are implemented manually, but it would be incredibly useful and more efficient to have this information automatically available in technical workflows. Furthermore, in the case of contextualization of 50+ pages, it will be necessary. 

11 replies

Anita Hæhre
Seasoned Practitioner
Forum|alt.badge.img+1
  • Head of Academy and Community
  • 590 replies
  • June 1, 2023
NewGathering Interest

Tommy Thorsen
Practitioner

Hi Gayatri,

It is not so trivial to add this info to the Files API, but we have all of this information inside the document processing system that exposes the Documents API. The page count field and the file size is already available in the Documents API. The hash is not there, but it would probably not be that hard to expose that as well. I can look into this.

Would it work for you to get this information from the Documents API? Bear in mind that the Documents API is eventually consistent with the Files API, and in some cases it can take some time from you upload something to the Files API until it is available in the Documents API.


Noah Karsky
Practitioner
Forum|alt.badge.img
  • Data Engineer
  • 6 replies
  • June 2, 2023

Meh, it would work to have them in the Documents API, but with files being in the UI and in the SDK’s I would prefer to have it in Files API. Nonetheless, if it is more quickly accomplished by adding the hash to the documents it would be great to have in the interim. 


Tommy Thorsen
Practitioner

I guess it depends what kind of UI you are using, but at least Fusion has switched from using the Files API to use the Documents API in order to get those extra bits of functionality that the Documents API provides. Look at the screenshot below for proof 😄

I’m with you on the SDK support, though. It’s a real shame we have not been able to add Python SDK support for the Documents API yet.


Noah Karsky
Practitioner
Forum|alt.badge.img
  • Data Engineer
  • 6 replies
  • June 4, 2023

ooohh I did not notice that! Thanks for showing! @Tommy Thorsen 


Anita Hæhre
Seasoned Practitioner
Forum|alt.badge.img+1
  • Head of Academy and Community
  • 590 replies
  • June 5, 2023
Gathering InterestPlanned for development

Forum|alt.badge.img

@Tommy Thorsen apologies if this is a basic question, but for the md5 of the file, this is generated based on the full contents of the file and not just the 1st MB of contents that is available to preview. 

Is this correct?


Anita Hæhre
Seasoned Practitioner
Forum|alt.badge.img+1
  • Head of Academy and Community
  • 590 replies
  • June 5, 2023

Hi @Bartosz Czernia, don't apologize for your question! It's great that you're reaching out to our community for help. Rest assured that we embrace all types of questions, no matter how basic they may seem. In fact, many others may have the same question but haven't asked yet. So by asking, you're not only helping yourself but also contributing to the collective knowledge of our community. We appreciate your engagement and encourage you to continue asking any questions you may have. We're here to support each other 🚀


Tommy Thorsen
Practitioner

@Bartosz Czernia the hash is made on the original binary content of the file, and not on the extracted plain text. Also, I don’t know if matters to you, but the hash is not an md5 hash, it’s a sha256.


Tommy Thorsen
Practitioner

The hash is now available from the Document Search API. See documentation here.


Tommy Thorsen
Practitioner
Planned for developmentImplemented

Cookie Policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie Settings