I have implemented block-updates to work with CDF Files. This by itself works quite well, as documented here.
However, for very large files there are some problems. After some time, typically around 4 minutes, one of the block updates fail, with a simple `ReadError`. There’s no response from the API.
Some logs from our application:
DEBUG:odp.odcat.storage.cdffs.async_spec:Uploaded 133120 bytes to https://westeurope-1.cognitedata.com/api/v1/files/storage/cognite/6723545981847516/7406155183547467/wod_gld_2013.nc?sig=REDACTED&rsct=application%2Foctet-stream&sp=cw&sr=b DEBUG:odp.odcat.storage.cdffs.async_spec:Total bytes uploaded: 2660989807 DEBUG:odp.odcat.storage.cdffs.async_spec:Data: len=5376000, MD5=6a0c0db1b66dac18ee28e8b082830667 DEBUG:odp.odcat.storage.cdffs.async_spec:Buffer cursors: start=0, end=5242880, final=False DEBUG:httpx._client:HTTP Request: PUT https://westeurope-1.cognitedata.com/api/v1/files/storage/cognite/6723545981847516/7406155183547467/wod_gld_2013.nc?sig=REDACTED&rsct=application%2Foctet-stream&sp=cw&sr=b&comp=block&blockid=gAAAAAAAA94= "HTTP/1.1 201 Created" DEBUG:odp.odcat.storage.cdffs.async_http_client:status code returned: 201 DEBUG:odp.odcat.storage.cdffs.async_spec:Uploaded 5242880 bytes to https://westeurope-1.cognitedata.com/api/v1/files/storage/cognite/6723545981847516/7406155183547467/wod_gld_2013.nc?sig=REDACTED&rsct=application%2Foctet-stream&sp=cw&sr=b DEBUG:odp.odcat.storage.cdffs.async_spec:Total bytes uploaded: 2666232687 DEBUG:odp.odcat.storage.cdffs.async_spec:Data: len=5376000, MD5=6a0c0db1b66dac18ee28e8b082830667 DEBUG:odp.odcat.storage.cdffs.async_spec:Buffer cursors: start=5242880, end=10485760, final=False DEBUG:httpx._client:HTTP Request: PUT https://westeurope-1.cognitedata.com/api/v1/files/storage/cognite/6723545981847516/7406155183547467/wod_gld_2013.nc?sig=REDACTED&rsct=application%2Foctet-stream&sp=cw&sr=b&comp=block&blockid=gAAAAAAAA98= "HTTP/1.1 201 Created" DEBUG:odp.odcat.storage.cdffs.async_http_client:status code returned: 201 DEBUG:odp.odcat.storage.cdffs.async_spec:Uploaded 133120 bytes to https://westeurope-1.cognitedata.com/api/v1/files/storage/cognite/6723545981847516/7406155183547467/wod_gld_2013.nc?sig=REDACTED&rsct=application%2Foctet-stream&sp=cw&sr=b DEBUG:odp.odcat.storage.cdffs.async_spec:Total bytes uploaded: 2666365807 DEBUG:odp.odcat.storage.cdffs.async_spec:Data: len=4772470, MD5=474504b7e60e0163b07bdd85bb47bfe2 DEBUG:odp.odcat.storage.cdffs.async_spec:Buffer cursors: start=0, end=4772470, final=True ERROR:odp.odcat.storage.cdffs.async_files:A transport error occurred while requesting https://westeurope-1.cognitedata.com/api/v1/files/storage/cognite/6723545981847516/7406155183547467/wod_gld_2013.nc?sig=REDACTED&rsct=application%2Foctet-stream&sp=cw&sr=b&comp=block&blockid=gAAAAAAAA+A=. WARNING:odp.odcat.storage.cdffs.async_spec:, retrying in 1.0... ERROR:odp.odcat.storage.cdffs.async_files:A transport error occurred while requesting https://westeurope-1.cognitedata.com/api/v1/files/storage/cognite/6723545981847516/7406155183547467/wod_gld_2013.nc?sig=REDACTED&rsct=application%2Foctet-stream&sp=cw&sr=b&comp=block&blockid=gAAAAAAAA+A=. WARNING:odp.odcat.storage.cdffs.async_spec:, retrying in 2.0... ERROR:odp.odcat.storage.cdffs.async_files:A transport error occurred while requesting https://westeurope-1.cognitedata.com/api/v1/files/storage/cognite/6723545981847516/7406155183547467/wod_gld_2013.nc?sig=REDACTED&rsct=application%2Foctet-stream&sp=cw&sr=b&comp=block&blockid=gAAAAAAAA+A=. WARNING:odp.odcat.storage.cdffs.async_spec:, retrying in 4.0... ERROR:odp.odcat.storage.cdffs.async_files:A transport error occurred while requesting https://westeurope-1.cognitedata.com/api/v1/files/storage/cognite/6723545981847516/7406155183547467/wod_gld_2013.nc?sig=REDACTED&rsct=application%2Foctet-stream&sp=cw&sr=b&comp=block&blockid=gAAAAAAAA+A=. WARNING:odp.odcat.storage.cdffs.async_spec:, retrying in 8.0... ERROR:odp.odcat.storage.cdffs.async_files:A transport error occurred while requesting https://westeurope-1.cognitedata.com/api/v1/files/storage/cognite/6723545981847516/7406155183547467/wod_gld_2013.nc?sig=REDACTED&rsct=application%2Foctet-stream&sp=cw&sr=b&comp=block&blockid=gAAAAAAAA+A=. ERROR:blacksheep.server:Unhandled exception - "PATCH /data/catalog.hubocean.io/dataset/test-dataset/wod_gld_2013.nc" Traceback (most recent call last): File "blacksheep/baseapp.pyx", line 84, in blacksheep.baseapp.BaseApplication.handle File "/home/tom/project/odf/BlackSheep/blacksheep/middlewares.py", line 6, in middleware_wrapper return await handler(request, next_handler) File "/home/tom/project/odf/app-odcat/sdk/odp/odcat/api/middlewares/prometheus_middleware.py", line 116, in __call__ raise e from None File "/home/tom/project/odf/app-odcat/sdk/odp/odcat/api/middlewares/prometheus_middleware.py", line 113, in __call__ response = await handler(request) File "/home/tom/project/odf/BlackSheep/blacksheep/server/normalization.py", line 459, in handler response = await method(*values) File "/home/tom/project/odf/app-odcat/controllers/storage-raw/odcat/storage_raw/handlers/raw_storage_handler.py", line 380, in upload_file_by_dataset_qname return await self._upload_file( File "/home/tom/project/odf/app-odcat/sdk/odp/odcat/utils/http_exc.py", line 31, in handler return ensure_response(await next_handler(*args, **kwargs)) File "/home/tom/project/odf/app-odcat/controllers/storage-raw/odcat/storage_raw/handlers/raw_storage_handler.py", line 172, in _upload_file async with file_store.open(file_info, "wb") as fd: File "/usr/lib64/python3.10/contextlib.py", line 206, in __aexit__ await anext(self.gen) File "/home/tom/project/odf/app-odcat/sdk/odp/odcat/storage/cdf/cdf_file_store.py", line 48, in open async with super().open( File "/usr/lib64/python3.10/contextlib.py", line 206, in __aexit__ await anext(self.gen) File "/home/tom/project/odf/app-odcat/sdk/odp/odcat/storage/file_store.py", line 66, in open async with fs.open_async(str(fname), mode, **kwargs) as fd: File "/home/tom/.cache/pypoetry/virtualenvs/app-odcat-Dj0wURcA-py3.10/lib/python3.10/site-packages/fsspec/asyn.py", line 1020, in __aexit__ await self.close() File "/home/tom/.cache/pypoetry/virtualenvs/app-odcat-Dj0wURcA-py3.10/lib/python3.10/site-packages/fsspec/asyn.py", line 979, in close await self.flush(force=True) File "/home/tom/.cache/pypoetry/virtualenvs/app-odcat-Dj0wURcA-py3.10/lib/python3.10/site-packages/fsspec/asyn.py", line 1012, in flush if await self._upload_chunk(final=force) is not False: File "/home/tom/project/odf/app-odcat/sdk/odp/odcat/storage/cdffs/async_spec.py", line 969, in _upload_chunk raise RuntimeError from cognite_exp RuntimeError INFO: 127.0.0.1:58422 - "PATCH /data/catalog.hubocean.io/dataset/test-dataset/wod_gld_2013.nc HTTP/1.1" 500 Internal Server Error
My first thought was that the upload url expired, which is causing this issue. However after conferring with a colleague, I learned the upload URL should be valid for a week, so that shouldn’t be an issue.
Any idea what may be happening here?
Page 1 / 1
Hello Thomas, my suggestion for resolving this issue would be to implement a retry strategy with back-off for uploading blocks. I have noticed that there is no request ID in the error logs, which makes it difficult to investigate further on our end.
There is an exponential backoff retry strategy already implemented (double the wait for each try, max 10 seconds). This should also be visible in the top-half of the log.
Not sure if it is possible to provide a request ID, as I am not getting any response from the server.
That’s true, I see it now. I will try to reach out to someone from the storage team on our side. What is the size of the file to be uploaded, by the way?
@thomafred could you please create a support ticket about that problem?
@thomafred Can you provide some more information about the response you receive when your application starts retrying?
The response logged with: “A transport error occurred while requesting ...” A request id for one of these requests would be helpful for us to investigate the matter.
The last line of your log:
500 Internal Server Error
Is this something coming from azure or your application?
My apologies for the confusion. The logs provided are from the logs of our own internal service - the 500 error is generated in response to the user request. We are not receiving any response from the CDF API.
@thomafred I did find references to your file in our logs, but I would need some more information to fully investigate the issue. Can you retry the upload and record the request ids of the successful requests until you fail to get responses?
Here’s the response-headers of the last fewPUT-requests
date
x-ms-request-id
x-ms-content-crc64
Thu, 20 Jul 2023 07:20:56 GMT
7fe59a18-d01e-0067-34da-ba1b5d000000
KCEynvNwPkg=
Thu, 20 Jul 2023 07:20:56 GMT
1c6e3a95-f01e-0065-63da-ba4d59000000
ULboHdXz6Oo=
Thu, 20 Jul 2023 07:20:57 GMT
43b3866c-701e-007d-03da-baa56a000000
AwfypHp6avE=
Thu, 20 Jul 2023 07:20:57 GMT
7f69797a-d01e-0077-7cda-baab7f000000
1BtDPMJLLWo=
Thu, 20 Jul 2023 07:20:58 GMT
f4de6d2c-101e-003f-14da-ba207f000000
OOiCTQl/tsg=
@thomafred Thanks for sharing the details and apologies for the delay in getting back. We’re looking into this and will share an update as soon as we have any more information.
@thomafred would you mind sharing some more request ids? I haven’t found anything obvious yet.
And does the upload consistently fail after 4 minutes?
how are you performing the upload? Do you mind sharing your script?
Based on the available information this is just as likely to be an issue with the calling service and/or its network connectivity. Actionable data for us would be requestids in failure responses from the CDF API services, or if that is not available or a response is not received at all, timestamps and other details (URL, request method and headers) of requests to CDF that fail.
Apologies for the late response. I have been on holiday and have not had a chance to follow up until now.
Here’s a an excerpt from our logs containing the information you need. Note that this is from a separate test-run than the earlier request-ids posted.
I did some more digging and found that it appears to be the last block that is failing. This block is significantly larger than the other blocks at 4772525 bytes, compared to 133120 bytes for the other blocks (4.55 MiB and 0.13 MiB respectively).
A quick look at the Azure-docs shows that the maximum block size for service versions prior to 2016-05-31 is 4 MiB:
Can you confirm that the maximum block size is 4 MiB?
Hi @thomafred, The response header `x-ms-version` indicates the service version used to serve the request and it appears from the logs that it used service version - 2020-06-12 and the maximum block size should be 4000 MiB.
I have done some tests with much bigger (40 MB, 100 MB, 400 MB) block sizes also and I cannot observe any issues.
Hi Alex and thanks for the input.
I can confirm that increasing the block size did not negatively impact the implementation.
Furthermore, it seems that there may have been a problem on our end related to how the block IDs were encoded. How this resulted in a transport-error baffles me a bit but overall everything appears to run smoothly now.