Transformations CLI

2 years ago
12 January 2022
0 replies
271 views

Userlevel 1

Emel Varol
Practitioner
4 replies

Hi! Welcome back to the Transformations group on Cognite Hub. Now that you have tried out the transformations SDK and API ( and, I hope you liked it and didn’t miss sharing your experience or feedback with us). In this post, I will introduce you to Transformations CLI. Before that, a bit about me and why I am writing this article - I’m Emel Varol, Software Engineer in the Data integrations team. I contributed to the design and development of the new Transformations CLI and I can’t wait to introduce it to you. Let’s get started!

Please note : The new Tranformations CLI replaces the old Jetfire CLI

This article has two parts. The first part focuses on how to use Transformations CLI on your local system or virtual machine for developing transformations. The second part focuses on using CLI in version control systems like GitHub for CI/CD

The following parts aim to summarise the documentation for the CLI and the GitHub Action. Please refer to the full documentation here for more details!

Transformations CLI for development

Prerequisites

Python 3 and pip installed on your local machine or VM
OIDC client credentials or API Key to use for authentication

Installation

pip install cognite-transformations-cli

Authentication

Declare/set the below environment variables

API Key - TRANSFORMATIONS_API_KEY and (Optional)TRANSFORMATIONS_PROJECT
OIDC client credentials - TRANSFORMATIONS_CLIENT_ID , TRANSFORMATIONS_CLIENT_SECRET ,TRANSFORMATIONS_TOKEN_URL, TRANSFORMATIONS_PROJECT, TRANSFORMATIONS_SCOPES, Use TRANSFORMATIONS_AUDIENCE instead of TRANSFORMATIONS_SCOPES for Auth0

Usage

By default, transformations-cli runs against the main CDF cluster (europe-west1-1). To use a different cluster, specify the --cluster parameter or set the environment variable TRANSFORMATIONS_CLUSTER. Note that this is a global parameter, which must be specified before the subcommand. For example:

transformations-cli --cluster=greenfield <subcommand> [...args]

Commands

Command	Args	Options	Description
list		--limit, --interactive	List transformations
show		--external-id, --id, --job-id	Show a transformation/job
jobs		--external-id, --id, --limit, --interactive	List jobs
delete		--external-id, --id, --delete-schedule	Delete a transformation
query	query	--source-limit, --infer-schema-limit, --limit	Run a query
run		--external-id, --id, --watch, --watch-only, --time-out	Run a Transformation
deploy	path, --debug		Deploy transformations

Do check our detailed documentation for more details

Transformations CLI for CI/CD

The Transformations CLI provides a GitHub Action to deploy transformations. You can find the documentation here. We've also created a CI/CD template that uses GitHub Workflows to run the transformations-cli and to deploy transformations to CDF on merges to master. If you want to use it with multiple CDF projects, e.g. customer-dev and customer-prod, you can clone the deploy-push-master.yml file and modify it for merges to a specific branch of your choice.

Transformations CLI replaces the Jetfire CLI. If you've already used the Jetfire CLI in a GitHub Action, we recommend migrating to the Transformations CLI GitHub Action. You'll find the migration guide here.

Repository Layout

In the top-level folder transformations you may add each transformation job as a new directory, for example:

.
├── .github
│   └── workflows
│       ├── deploy-push-master.yml
│       └── check-pr.yml
├── README.md
└── transformations
    ├── my_transformation_001
    │   ├── manifest.yml
    │   └── transformation.sql
    └── my_transformation_002
        ├── manifest.yml
        └── transformation.sql

However, you can pretty much change this layout however you see fit - as long as you obey the following rule: Keep one transformation (manifest+sql) per folder.

Let’s get started

Use the CI/CD template from here and create a repository
In the newly created repository create a branch from “main”
There are two different authentication flows:
1. Using API-keys (old)
2. Using OIDC (new)
Thus, this repository contains two similar (but different) workflow-files, one for each flow:
1. API-key flow: .github/workflows/deploy-push-api-key-master.yml
2. OIDC flow: .github/workflows/deploy-push-oidc-master.yml

You should choose one of these, although a combination is possible, either through having different auth flow for different branches (please don't) or using an API-key for transformations (at runtime), but using OIDC flow for the deployment of transformations (or the other way around - also, please don't).

4. Declare environment variables in secretes under settings of the repository

API-key flow

In order to connect to CDF, we need the API-key for the transformations-cli to be able to deploy the transformations (this is separate from the key used at runtime). In this template, it will be read automatically by the workflow, by reading it from your GitHub secrets. Thus, surprise surprise, you need to store the API-key in GitHub secrets in your own repo. However, there is one catch! To distinguish between the API-key meant for e.g. testing and production environments, we control this by appending the branch name responsible for deployment to the end of the secret name as follows: TRANSFORMATIONS_API_KEY_{BRANCH}.

Let's check out an example. On merges to 'master', you want to deploy to customer-dev, so you use the API-key for this project and store it as a GitHub secret with the name:

# Assuming you have one 'master' branch you use for deployments,
# the secrets you need to store are:
TRANSFORMATIONS_API_KEY_${BRANCH} -> TRANSFORMATIONS_API_KEY_MASTER
COGNITE_API_KEY_${BRANCH} -> COGNITE_API_KEY_MASTER

Similarly, if you have a customer-prod project, and you have created a workflow that only runs on your branch named prod, you would need to store the API-key to this project under the GitHub secret: TRANSFORMATIONS_API_KEY_PROD (and similarly for the runtime key). You can of course repeat this for as many projects as you want!

OIDC flow

In essence, the OIDC flow is very similar, except we use a pre-shared client secret instead of an API-key. we expect you to store the client secrets as secrets in your Github repository. This way we can automatically read them in the workflows, sweeeeet. The expected setup is as follows:

# Assuming you have one 'master' branch you use for deployments,
# the secrets you need to store are:
TRANSFORMATIONS_CLIENT_SECRET_${BRANCH} -> TRANSFORMATIONS_CLIENT_SECRET_MASTER
COGNITE_CLIENT_SECRET_${BRANCH} -> COGNITE_CLIENT_SECRET_MASTER

# If you need separation of e.g. dev/test/prod, you can then use
# another branch named 'prod' (or test or dev...):
TRANSFORMATIONS_CLIENT_SECRET_${BRANCH} -> TRANSFORMATIONS_CLIENT_SECRET_PROD
COGNITE_CLIENT_SECRET_${BRANCH} -> COGNITE_CLIENT_SECRET_PROD

However, OIDC flow needs a few other bits of information to work: The client ID, token URL, project name and scopes. In the workflow file, you must change these in accordance with your customer's setup. You can also store the values of these variables in secrets. For example, you can store the “project name” where you want to deploy the transformation secrets and use it workflow file as shown below.

CDF_PROJECT_NAME_${BRANCH} -> CDF_PROJECT_NAME_MASTER

Similarly for the credentials that are going to be used at runtime (as opposed to deployment of transformations to CDF), except these are specified in each manifest-file, pointing to specific environment variables. You must/should specify these environment variables in the workflow file. Let's check out a full example:

###########################
# From the workflow file: #
###########################
- name: Deploy transformations
  uses: cognitedata/transformations-cli@main
  env:
      # Credentials to be used when running your transformations,
      # as referenced in your manifests:
      COGNITE_CLIENT_ID: my-cognite-client-id
      CCOGNITE_CLIENT_SECRET: ${{ secrets[steps.extract_secrets.outputs.transformation_client_secret] }}
  with:
      # Credentials used for deployment
      path: transformations  # Transformation manifest folder, relative to GITHUB_WORKSPACE
      client-id: my-jetfire-client-id
      client-secret: ${{ secrets.transformation_client_secret_master }}
      token-url: https://login.microsoftonline.com/<my-azure-tenant-id>/oauth2/v2.0/token
      cdf-project-name: ${{ secrets.cdf_project_name__${BRANCH}}}
      # If you need to provide multiple scopes, the format: "scope1 scope2 scope3"
      scopes: https://<my-cluster>.cognitedata.com/.default
      # audience: "" # Optional


#####################
# In all manifests: #
#####################
authentication:
  # The following are explicit values, not environment variables
  tokenUrl: https://login.microsoftonline.com/<my-azure-tenant-id>/oauth2/v2.0/token
  scopes:
      - https://<my-cluster>.cognitedata.com/.default
  cdfProjectName: <my-project-name>
  # The following are given as the name of an environment variable:
  clientId: ${COGNITE_CLIENT_ID}
  clientSecret: ${COGNITE_CLIENT_SECRET}

The one thing to take away from this example is that the manifest variables, like clientId, points to the corresponding environment variables specified below env in the workflow file. Feel free to name these whatever you want.

6. Changes to manifest files

The manifest file is a yaml-file that describes the transformation, like name, schedule, external ID and what CDF resource type it modifies, - and in what way (like create vs upsert)!

The required fields are:

- externalId
- name
- query        # Relative path to the file containing the SQL query
- destination  # One of: assets, asset_hierarchy, events, timeseries, datapoints, string_datapoints

Note on writing to raw

When writing to RAW tables, you also need to specify type, rawDatabase and rawTable like this in the yaml-file:

destination:
  type: raw
  rawDatabase: someDatabase
  rawTable: someTable

Required field for auth

authentication must be provided.

To use API-keys, the API-key to be used in the transformation must be provided with the following syntax:

authentication:
  apiKey: ${API_KEY} # Env var as referenced in deploy step

If you want to use separate API-keys for read/write, the following syntax is needed:

authentication:
  read:
    apiKey: ${READ_API_KEY} # Env var as referenced in deploy step
  write:
    apiKey: ${WRITE_API_KEY} # Env var as referenced in deploy step

To use OIDC auth flow for read/write, the client credentials to be used in the transformation must be provided with the following syntax ( replace scopes with audience if you are using Auth0:

authentication:
  tokenUrl: https://login.microsoftonline.com/<my-azure-tenant-id>/oauth2/v2.0/token
  scopes:
      - https://<my-cluster>.cognitedata.com/.default
  cdfProjectName: <my-project-name>
  clientId: ${COGNITE_CLIENT_ID}          # Env var as referenced in deploy step
  clientSecret: ${COGNITE_CLIENT_SECRET}  # Env var as referenced in deploy step

If you want to use separate OIDC credentials for read/write, the following syntax is needed:

authentication:
  read:
    tokenUrl: https://login.microsoftonline.com/<read-azure-tenant-id>/oauth2/v2.0/token
    scopes:
        - https://<my-cluster>.cognitedata.com/.default
    cdfProjectName: <read-project-name>
    clientId: ${COGNITE_CLIENT_ID}          # Env var as referenced in deploy step
    clientSecret: ${COGNITE_CLIENT_SECRET}  # Env var as referenced in deploy step
  write:
    tokenUrl: https://login.microsoftonline.com/<write-azure-tenant-id>/oauth2/v2.0/token
    scopes:
        - https://<my-cluster>.cognitedata.com/.default
    cdfProjectName: <write-project-name>
    clientId: ${COGNITE_CLIENT_ID_WRITE}          # Env var as referenced in deploy step
    clientSecret: ${COGNITE_CLIENT_SECRET_READ}   # Env var as referenced in deploy step

The optional fields are:

- shared           # Default: true
- schedule         # Default: null (no scheduling!)
- action           # Default: upsert
- notifications    # Default: null
- ignoreNullFields # Default: true

Valid values for action:

upsert: Create new items, or update existing items if their id or externalId already exists.
create: Create new items. The transformation will fail if there are id or externalId conflicts.
update: Update existing items. The transformation will fail if an id or externalId does not exist.
delete: Delete items by internal id.

Schedule

Gotcha: Since CRON expressions typically use a lot of asterisks, (this fellow: *), you have to wrap your schedule in single quotes (') if it starts with one (asterisk)!

schedule: */2 * * * *    # Not valid
schedule: 0/2 * * * *    # Valid, but non-standard CRON
schedule: '*/2 * * * *'  # Valid

Pre-commit

Although this repository has nothing to do with Python, it contains a configuration file for a very neat pip-installable package, pre-commit. What it does is install some git-hooks that run automatically when you try to commit changes. The repository also has a workflow that runs on all PRs that uses yamllint to verify the .yaml-files. The exact command is: yamllint transformations/, to avoid checking other yaml-files that might be present in the repo.

What these hooks do, is to run a yaml-formatter, then yamllint to verify. Thus, these tools are only meant for your convenience and can be used by running the follwing from the root of the repository:

# Assuming you have python available
pip install pre-commit
pre-commit install

# You can also run all files:
pre-commit run --all-files

This is only needed the first time, for all future commits from this repo, the hooks will run!

0 replies

Be the first to reply!