Backgorund
We have an ongoing project where we need to deploy Cognite Functions and other resources (TimeSeries and Sequences) regularly, both during development/debugging and when publishing new versions of our product.
The way we have solved this today is this:
- We maintain a “deployment template” YAML file that defines all the “stuff” that needs to be deleted/created during a redeployment. This includes Cognite Functions, TimeSeries, and Sequences. Each entry in the YAML contains all necessary data for creating the relevant resource.
- We have written some Python classes that perform resource specific deployment (CogniteFunctionDeployer, TimeSeriesDeployer, SequenceDeployer). These behave quite similarly, with methods that backup data, delete, and recreate the resource.
- These classes are instantiated in our “deploy.py” script, which just itereates over all the resources defined in the template (step 1), and calls the “.deploy” method on each resource deployer class.
- The deploy script is called from within a Github Workflow “deployment.yaml”. Here we define a matrix strategy that makes sure we can trigger the deployment of all functions and resources for all assets at the same time. (note that we have logic that guarantees that no resources is attempted deployed by two independent calls of the deploy script. So this would not be the cause of our observed errors.)
This is basically our Github Action file:
name: Deployment
on:
workflow_dispatch:
inputs:
log_level:
type: choice
description: 'Log level for deployment.'
default: 'DEBUG'
options:
- 'DEBUG'
- 'INFO'
- 'WARNING'
jobs:
sandbox_deployment:
runs-on: ubuntu-latest
strategy:
matrix:
asset:
- "asset1"
- "asset2"
function:
- "func-1"
- "func-2"
- "func-3"
- "func-4"
environment: sandbox
env:
CDF_CLIENT_ID: ${{ secrets.CDF_CLIENT_ID }}
CDF_CLIENT_SECRET: ${{ secrets.CDF_CLIENT_SECRET }}
CDF_TENANT_ID: ${{ secrets.CDF_TENANT_ID }}
TRIGGER_BRANCH_NAME: ${{ github.ref }}
GITHUB_ACTOR: ${{ github.actor }}
GITHUB_SHA: ${{ github.sha }}
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install Poetry
run: pip3 install poetry
- name: Install dependencies
run: |
poetry config virtualenvs.create false
poetry install
- name: Generate requirements.txt
run: |
poetry export -f requirements.txt --output requirements.txt --without-hashes
- name: Deploy ${{ matrix.asset }} to sandbox
run: |
python -u scripts/deploy.py sandbox ${{ matrix.asset }} ${{ matrix.function }} \
--log_level=${{ inputs.log_level }}
Doing it this way also allows us to easily “deactivate” certain assets/functions during development and debugging.
The Problem
However, we have noticed lately that some of the Cognite Function deployments fail, with the error in CDF
Function deployment failed unexpectedly. Please try again and contact Cognite Support if the problem persists.
The problem has persisted, and so here I am :)
Testing / Debugging
I want to avoid a “code dump” and expect someone to find my problem, but it is a bit difficult to reproduce it. However, I have gone through the following test sequence by commenting out various parts of the matrix strategy:
# THIS WORKS
strategy:
matrix:
asset:
- "asset1"
function:
- "func-1"
# THIS WORKS
strategy:
matrix:
asset:
- "asset1"
function:
- "func-1"
- "func-2"
# THIS WORKS
strategy:
matrix:
asset:
- "asset2"
function:
- "func-1"
# THIS WORKS
strategy:
matrix:
asset:
- "asset2"
function:
- "func-1"
- "func-2"
# THIS WORKS
strategy:
matrix:
asset:
- "asset1"
function:
- "func-1"
- "func-2"
- "func-3"
- "func-4"
# THIS WORKS
strategy:
matrix:
asset:
- "asset2"
function:
- "func-1"
- "func-2"
- "func-3"
- "func-4"
# THIS WORKS
strategy:
matrix:
asset:
- "asset1"
- "asset2"
function:
- "func-1"
- "func-2"
# THIS WORKS
strategy:
matrix:
asset:
- "asset1"
- "asset2"
function:
- "func-1"
- "func-2"
- "func-3"
# THIS WORKS
strategy:
matrix:
asset:
- "asset1"
- "asset2"
function:
- "func-1"
- "func-2"
- "func-4"
# THIS FAILS
strategy:
matrix:
asset:
- "asset1"
- "asset2"
function:
- "func-1"
- "func-2"
- "func-3"
- "func-4"
It is not always the same function that fails. I have tried a lot of combinations, but the errors only arise when I deploy both assets and all four functions.
I should perhaps also say that all calls to the deploy script are completely independent. There is no overlap in the resources they create / delete during the course of deployment. All deployed resources end up in the same dataset.
Is there anything in the way the deployments take place under the hood that could explain my observed behavior? Is this kind of matrix strategy deployment discouraged?