Solved

Streaming upload to CDF Files


Userlevel 4
Badge

Hi there!

I have a usecase where a file is uploaded by a user to an API. The API then uploads the file to CDF Files. We want to avoid having to have the full file in memory at the same time, and therefore must stream the file contents from the request handler directly into CDF Files.

There are two ways of achieving this:

  • Stream the request body from the request handler directly into CDF Files’ upload URL
  • Chunk the request body and upload each chunk as separate requests.

The first option may be achievable, but I don’t believe the second option is possible.

Do you have any insight whether it is possible to chunk a file upload like this in CDF Files?

icon

Best answer by Dilini Fernando 3 July 2023, 13:22

View original

10 replies

Userlevel 3

Hey @thomafred ,

 

According to the documentation :

If the uploadUrl contains the string '/v1/files/gcs_proxy/', you can make a Google Cloud Storage (GCS) resumable upload request as documented in https://cloud.google.com/storage/docs/json_api/v1/how-tos/resumable-upload.

 

And following the link there is an instruction about uploading files in chunks. 

I hope it helps you. Feel free to let me know if you have any additional questions.

Userlevel 4
Badge

Will this work if our CDF tenant is located on Azure?

Userlevel 3

@thomafred could you double-check that? If your IdP is Azure, that doesn’t mean yet that CDF resources are also hosted on Azure. But if you are, then I suppose you can leverage the documentation for Azure blob storage. 

Userlevel 4
Badge

We are exclusively running CDF on Azure :p We are also using Azure as idP.

In other words - yes to both :p

Userlevel 3

@thomafred then, most probably, you need to create a blob with Put Blob as described in the documentation and then use Put Block to upload chunks and Put Block List to commit the chunks after they are all uploaded. But it should be proven experimentally, I haven’t dealt with that particular case.

Userlevel 4
Badge +2

Hi @thomafred,

I hope Roma’s reply has helped you. Let us know if you have more questions.  

Best regards,
Dilini

 

Userlevel 4
Badge +2

Hi @thomafred,

As of now, I will close this thread. If you have any questions, please feel free to reply here.

Best regards,
Dilini  

Userlevel 4
Badge

Sorry about the delay, I have only now been able to follow up on this.

Did a quick proof-of-concept using postman, and using `PUT block` and `PUT blocklist` seems to work, however there are some limitations.

First and foremost, it appears that I am not authorized do the `GET blocklist`-operation (https://learn.microsoft.com/en-us/rest/api/storageservices/get-block-list?tabs=azure-ad), and instead getting the following error:

<?xml version="1.0" encoding="utf-8"?>
<Error>
<Code>AuthorizationPermissionMismatch</Code>
<Message>This request is not authorized to perform this operation using this permission.
RequestId:44177ea5-601e-0030-78d2-ada960000000
Time:2023-07-03T17:21:35.0950174Z</Message>
</Error>

I was able to do an async Python POC though :)

 

Userlevel 4
Badge

An update on this.

Block updates work quite well. However, after a few minutes (typically around 4 minutes, but appears to be somewhat random), the upload may fail due to a read-error on a block update. There doesn’t appear to be any response from the server.

Userlevel 3

Just for documentation if others are interested, there’s a follow up post on the issue above: 

 

Reply