Hi! Welcome back to the Transformations group on Cognite Hub. Now that you have tried out the transformations SDK and API ( and, I hope you liked it and didn’t miss sharing your experience or feedback with us). In this post, I will introduce you to Transformations CLI. Before that, a bit about me and why I am writing this article - I’m Emel Varol, Software Engineer in the Data integrations team. I contributed to the design and development of the new Transformations CLI and I can’t wait to introduce it to you. Let’s get started!
Please note : The new Tranformations CLI replaces the old Jetfire CLI
This article has two parts. The first part focuses on how to use Transformations CLI on your local system or virtual machine for developing transformations. The second part focuses on using CLI in version control systems like GitHub for CI/CD
The following parts aim to summarise the documentation for the CLI and the GitHub Action. Please refer to the full documentation here for more details!
Transformations CLI for development
Prerequisites
-
Python 3 and pip installed on your local machine or VM
-
OIDC client credentials or API Key to use for authentication
Installation
pip install cognite-transformations-cli
Authentication
Declare/set the below environment variables
-
API Key - TRANSFORMATIONS_API_KEY and (Optional)TRANSFORMATIONS_PROJECT
-
OIDC client credentials - TRANSFORMATIONS_CLIENT_ID , TRANSFORMATIONS_CLIENT_SECRET ,TRANSFORMATIONS_TOKEN_URL, TRANSFORMATIONS_PROJECT, TRANSFORMATIONS_SCOPES, Use TRANSFORMATIONS_AUDIENCE instead of TRANSFORMATIONS_SCOPES for Auth0
Usage
By default, transformations-cli runs against the main CDF cluster (europe-west1-1). To use a different cluster, specify the --cluster parameter or set the environment variable TRANSFORMATIONS_CLUSTER. Note that this is a global parameter, which must be specified before the subcommand. For example:
transformations-cli --cluster=greenfield <subcommand> u...args]
Commands
Command | Args | Options | Description |
---|---|---|---|
list |
| --limit, --interactive | List transformations |
show |
| --external-id, --id, --job-id | Show a transformation/job |
jobs |
| --external-id, --id, --limit, --interactive | List jobs |
delete |
| --external-id, --id, --delete-schedule | Delete a transformation |
query | query | --source-limit, --infer-schema-limit, --limit | Run a query |
run |
|
| Run a Transformation |
deploy | path, --debug |
| Deploy transformations |
Do check our detailed documentation for more details
Transformations CLI for CI/CD
The Transformations CLI provides a GitHub Action to deploy transformations. You can find the documentation here. We've also created a CI/CD template that uses GitHub Workflows to run the transformations-cli and to deploy transformations to CDF on merges to master. If you want to use it with multiple CDF projects, e.g. customer-dev and customer-prod, you can clone the deploy-push-master.yml file and modify it for merges to a specific branch of your choice.
Transformations CLI replaces the Jetfire CLI. If you've already used the Jetfire CLI in a GitHub Action, we recommend migrating to the Transformations CLI GitHub Action. You'll find the migration guide here.
Repository Layout
In the top-level folder transformations you may add each transformation job as a new directory, for example:
.
├── .github
│ └── workflows
│ ├── deploy-push-master.yml
│ └── check-pr.yml
├── README.md
└── transformations
├── my_transformation_001
│ ├── manifest.yml
│ └── transformation.sql
└── my_transformation_002
├── manifest.yml
└── transformation.sql
However, you can pretty much change this layout however you see fit - as long as you obey the following rule: Keep one transformation (manifest+sql) per folder.
Let’s get started
-
Use the CI/CD template from here and create a repository
-
In the newly created repository create a branch from “main”
-
There are two different authentication flows:
-
Using API-keys (old)
-
Using OIDC (new)
Thus, this repository contains two similar (but different) workflow-files, one for each flow:
-
API-key flow: .github/workflows/deploy-push-api-key-master.yml
-
OIDC flow: .github/workflows/deploy-push-oidc-master.yml
-
You should choose one of these, although a combination is possible, either through having different auth flow for different branches (please don't) or using an API-key for transformations (at runtime), but using OIDC flow for the deployment of transformations (or the other way around - also, please don't).
4. Declare environment variables in secretes under settings of the repository
API-key flow
In order to connect to CDF, we need the API-key for the transformations-cli to be able to deploy the transformations (this is separate from the key used at runtime). In this template, it will be read automatically by the workflow, by reading it from your GitHub secrets. Thus, surprise surprise, you need to store the API-key in GitHub secrets in your own repo. However, there is one catch! To distinguish between the API-key meant for e.g. testing and production environments, we control this by appending the branch name responsible for deployment to the end of the secret name as follows: TRANSFORMATIONS_API_KEY_{BRANCH}.
Let's check out an example. On merges to 'master', you want to deploy to customer-dev, so you use the API-key for this project and store it as a GitHub secret with the name:
# Assuming you have one 'master' branch you use for deployments,
# the secrets you need to store are:
TRANSFORMATIONS_API_KEY_${BRANCH} -> TRANSFORMATIONS_API_KEY_MASTER
COGNITE_API_KEY_${BRANCH} -> COGNITE_API_KEY_MASTER
Similarly, if you have a customer-prod project, and you have created a workflow that only runs on your branch named prod, you would need to store the API-key to this project under the GitHub secret: TRANSFORMATIONS_API_KEY_PROD (and similarly for the runtime key). You can of course repeat this for as many projects as you want!
OIDC flow
In essence, the OIDC flow is very similar, except we use a pre-shared client secret instead of an API-key. we expect you to store the client secrets as secrets in your Github repository. This way we can automatically read them in the workflows, sweeeeet. The expected setup is as follows:
# Assuming you have one 'master' branch you use for deployments,
# the secrets you need to store are:
TRANSFORMATIONS_CLIENT_SECRET_${BRANCH} -> TRANSFORMATIONS_CLIENT_SECRET_MASTER
COGNITE_CLIENT_SECRET_${BRANCH} -> COGNITE_CLIENT_SECRET_MASTER
# If you need separation of e.g. dev/test/prod, you can then use
# another branch named 'prod' (or test or dev...):
TRANSFORMATIONS_CLIENT_SECRET_${BRANCH} -> TRANSFORMATIONS_CLIENT_SECRET_PROD
COGNITE_CLIENT_SECRET_${BRANCH} -> COGNITE_CLIENT_SECRET_PROD
However, OIDC flow needs a few other bits of information to work: The client ID, token URL, project name and scopes. In the workflow file, you must change these in accordance with your customer's setup. You can also store the values of these variables in secrets. For example, you can store the “project name” where you want to deploy the transformation secrets and use it workflow file as shown below.
CDF_PROJECT_NAME_${BRANCH} -> CDF_PROJECT_NAME_MASTER
Similarly for the credentials that are going to be used at runtime (as opposed to deployment of transformations to CDF), except these are specified in each manifest-file, pointing to specific environment variables. You must/should specify these environment variables in the workflow file. Let's check out a full example:
###########################
# From the workflow file: #
###########################
- name: Deploy transformations
uses: cognitedata/transformations-cli@main
env:
# Credentials to be used when running your transformations,
# as referenced in your manifests:
COGNITE_CLIENT_ID: my-cognite-client-id
CCOGNITE_CLIENT_SECRET: ${{ secretswsteps.extract_secrets.outputs.transformation_client_secret] }}
with:
# Credentials used for deployment
path: transformations # Transformation manifest folder, relative to GITHUB_WORKSPACE
client-id: my-jetfire-client-id
client-secret: ${{ secrets.transformation_client_secret_master }}
token-url: https://login.microsoftonline.com/<my-azure-tenant-id>/oauth2/v2.0/token
cdf-project-name: ${{ secrets.cdf_project_name__${BRANCH}}}
# If you need to provide multiple scopes, the format: "scope1 scope2 scope3"
scopes: https://<my-cluster>.cognitedata.com/.default
# audience: "" # Optional
#####################
# In all manifests: #
#####################
authentication:
# The following are explicit values, not environment variables
tokenUrl: https://login.microsoftonline.com/<my-azure-tenant-id>/oauth2/v2.0/token
scopes:
- https://<my-cluster>.cognitedata.com/.default
cdfProjectName: <my-project-name>
# The following are given as the name of an environment variable:
clientId: ${COGNITE_CLIENT_ID}
clientSecret: ${COGNITE_CLIENT_SECRET}
The one thing to take away from this example is that the manifest variables, like clientId, points to the corresponding environment variables specified below env in the workflow file. Feel free to name these whatever you want.
6. Changes to manifest files
The manifest file is a yaml-file that describes the transformation, like name, schedule, external ID and what CDF resource type it modifies, - and in what way (like create vs upsert)!
The required fields are:
- externalId
- name
- query # Relative path to the file containing the SQL query
- destination # One of: assets, asset_hierarchy, events, timeseries, datapoints, string_datapoints
Note on writing to raw
When writing to RAW tables, you also need to specify type, rawDatabase and rawTable like this in the yaml-file:
destination:
type: raw
rawDatabase: someDatabase
rawTable: someTable
Required field for auth
authentication must be provided.
-
To use API-keys, the API-key to be used in the transformation must be provided with the following syntax:
authentication:
apiKey: ${API_KEY} # Env var as referenced in deploy step
If you want to use separate API-keys for read/write, the following syntax is needed:
authentication:
read:
apiKey: ${READ_API_KEY} # Env var as referenced in deploy step
write:
apiKey: ${WRITE_API_KEY} # Env var as referenced in deploy step
-
To use OIDC auth flow for read/write, the client credentials to be used in the transformation must be provided with the following syntax ( replace scopes with audience if you are using Auth0:
authentication:
tokenUrl: https://login.microsoftonline.com/<my-azure-tenant-id>/oauth2/v2.0/token
scopes:
- https://<my-cluster>.cognitedata.com/.default
cdfProjectName: <my-project-name>
clientId: ${COGNITE_CLIENT_ID} # Env var as referenced in deploy step
clientSecret: ${COGNITE_CLIENT_SECRET} # Env var as referenced in deploy step
If you want to use separate OIDC credentials for read/write, the following syntax is needed:
authentication:
read:
tokenUrl: https://login.microsoftonline.com/<read-azure-tenant-id>/oauth2/v2.0/token
scopes:
- https://<my-cluster>.cognitedata.com/.default
cdfProjectName: <read-project-name>
clientId: ${COGNITE_CLIENT_ID} # Env var as referenced in deploy step
clientSecret: ${COGNITE_CLIENT_SECRET} # Env var as referenced in deploy step
write:
tokenUrl: https://login.microsoftonline.com/<write-azure-tenant-id>/oauth2/v2.0/token
scopes:
- https://<my-cluster>.cognitedata.com/.default
cdfProjectName: <write-project-name>
clientId: ${COGNITE_CLIENT_ID_WRITE} # Env var as referenced in deploy step
clientSecret: ${COGNITE_CLIENT_SECRET_READ} # Env var as referenced in deploy step
The optional fields are:
- shared # Default: true
- schedule # Default: null (no scheduling!)
- action # Default: upsert
- notifications # Default: null
- ignoreNullFields # Default: true
Valid values for action:
-
upsert: Create new items, or update existing items if their id or externalId already exists.
-
create: Create new items. The transformation will fail if there are id or externalId conflicts.
-
update: Update existing items. The transformation will fail if an id or externalId does not exist.
-
delete: Delete items by internal id.
Schedule
Gotcha: Since CRON expressions typically use a lot of asterisks, (this fellow: *), you have to wrap your schedule in single quotes (') if it starts with one (asterisk)!
schedule: */2 * * * * # Not valid
schedule: 0/2 * * * * # Valid, but non-standard CRON
schedule: '*/2 * * * *' # Valid
Pre-commit
Although this repository has nothing to do with Python, it contains a configuration file for a very neat pip-installable package, pre-commit. What it does is install some git-hooks that run automatically when you try to commit changes. The repository also has a workflow that runs on all PRs that uses yamllint to verify the .yaml-files. The exact command is: yamllint transformations/, to avoid checking other yaml-files that might be present in the repo.
What these hooks do, is to run a yaml-formatter, then yamllint to verify. Thus, these tools are only meant for your convenience and can be used by running the follwing from the root of the repository:
# Assuming you have python available
pip install pre-commit
pre-commit install
# You can also run all files:
pre-commit run --all-files
This is only needed the first time, for all future commits from this repo, the hooks will run!