Skip to main content
Solved

Need dump yaml for instances deployment by CDF Toolkit


Forum|alt.badge.img+1

I want to upload instances using CDF Toolkit. For example, data model I can get dump yaml from existing data model using toolkit like following:

cdf dump datamodel

I want to get the dump yaml of instances from my existing space and data model which I can use it to later deploy using toolkit to different space or env.

 

I checked the doc here: https://docs.cognite.com/cdf/deploy/cdf_toolkit/references/resource_library#nodes

But creating yaml file manually would be too hard. I have also tried to get dump yaml using python sdk but that dump yaml looks incomplete to be used for deployment.

Best answer by Anders Albert

Currently, Toolkit does not support dumping of instances, but I will note it as a feature request. 

The workaround would be to use the PySDK to create the yaml files. The you can do something like this

 

retrieved = client.data_modeling.instances.retrieve(nodes=my_node_ids, edges=my_edge_ids)
        Path("my_nodes.Node.yaml").write_text(retrieved.nodes.as_write().dump_yaml(), encoding="utf-8")
        Path("my_edges.Edge.yaml").write_text(retrieved.edges.as_write().dump_yaml(), encoding="utf-8")

Notice the `.as_write()` method this converts the nodes from the response/read format to the request/write format that Toolkit needs. 

View original
Did this topic help you find an answer to your question?

11 replies

Forum|alt.badge.img+1
  • Author
  • Committed
  • 17 replies
  • January 30, 2025

@Anders Albert - We were able to deploy datamodel using the approach which you suggested last time. Would you be able to provide help in this case as well. Also, please let me know what is ideal approach if I want to deploy timeseries using toolkit as well.


Anders  Albert
Seasoned Practitioner
Forum|alt.badge.img
  • Seasoned Practitioner
  • 108 replies
  • Answer
  • January 31, 2025

Currently, Toolkit does not support dumping of instances, but I will note it as a feature request. 

The workaround would be to use the PySDK to create the yaml files. The you can do something like this

 

retrieved = client.data_modeling.instances.retrieve(nodes=my_node_ids, edges=my_edge_ids)
        Path("my_nodes.Node.yaml").write_text(retrieved.nodes.as_write().dump_yaml(), encoding="utf-8")
        Path("my_edges.Edge.yaml").write_text(retrieved.edges.as_write().dump_yaml(), encoding="utf-8")

Notice the `.as_write()` method this converts the nodes from the response/read format to the request/write format that Toolkit needs. 


Forum|alt.badge.img+1
  • Author
  • Committed
  • 17 replies
  • February 3, 2025
Anders Albert wrote:

Currently, Toolkit does not support dumping of instances, but I will note it as a feature request. 

The workaround would be to use the PySDK to create the yaml files. The you can do something like this

 

retrieved = client.data_modeling.instances.retrieve(nodes=my_node_ids, edges=my_edge_ids)
        Path("my_nodes.Node.yaml").write_text(retrieved.nodes.as_write().dump_yaml(), encoding="utf-8")
        Path("my_edges.Edge.yaml").write_text(retrieved.edges.as_write().dump_yaml(), encoding="utf-8")

Notice the `.as_write()` method this converts the nodes from the response/read format to the request/write format that Toolkit needs. 

Thanks Anders this really helps. Also, we have lots of instances so what should be ideal approach to deploy them while we want to use toolkit in our CICD pipeline? In aspect of effectiveness and performance


Anders  Albert
Seasoned Practitioner
Forum|alt.badge.img
  • Seasoned Practitioner
  • 108 replies
  • February 3, 2025

Typically, you would populate a data model either

  1. Through an extractor
  2. From RAW and using transformations.

It is possible to use Toolkit, but it will not necessarily be performant. Are your thinking to check in the YAML with the nodes into version control and use Toolkit to match the version controlled YAML with your CDF? Note that Toolkit is intended for governing of resources, not ingest data, although it is possible.


Forum|alt.badge.img+1
  • Author
  • Committed
  • 17 replies
  • February 3, 2025
Anders Albert wrote:

Typically, you would populate a data model either

  1. Through an extractor
  2. From RAW and using transformations.

It is possible to use Toolkit, but it will not necessarily be performant. Are your thinking to check in the YAML with the nodes into version control and use Toolkit to match the version controlled YAML with your CDF? Note that Toolkit is intended for governing of resources, not ingest data, although it is possible.

@Anders Albert 

Yes we are trying to use version control where we need all these yaml files. 

We have views which contain more than 1000 instances but using sdk’s client.data_modeling.instances.search I can get max 1000 instances. I dont get any continuation token to get further instances. I am using search method because here I can provide ViewId for specific view and get all instances.

We have more than 138K instances. We do deploy one time and our users further populate data base on these instances. Since, recommended approach to deploy data model is by using Toolkit therefore we are exploring possibilities to deploy our instances using toolkit as well.

So please help to get more than 1000 instances using ViewId.

Also, we would love to know your recommendation for our use case. 


Forum|alt.badge.img+1
  • Author
  • Committed
  • 17 replies
  • February 11, 2025

@Anders Albert  - Could you please check this thread and give your inputs? Also confirm us, if instances deployment of instances via toolkit is a right and one of the suggested approach. We would like to get more clarity if there are any limitations with instances deployment or there are any restrictions with attributes or properties.


Anders  Albert
Seasoned Practitioner
Forum|alt.badge.img
  • Seasoned Practitioner
  • 108 replies
  • February 11, 2025

We have views which contain more than 1000 instances but using sdk’s client.data_modeling.instances.search I can get max 1000 instances. I dont get any continuation token to get further instances. I am using search method because here I can provide ViewId for specific view and get all instances.

Search is for fast lookup, and is limited to 1000 instances. The client.data_modeling.instances.list will do pagination for you and thus if you pass limit=-1, will find all instances for a given view.

I would not use Toolkit for the population as you will be forced to store the instances as YAML which will be very verbose.  If you are comfortable with Python I would consider a custom script. Store the data in csv or parquet, have the script load the csv/parquet, and call client.data_modeling.intances.apply(). Alternative, could be to use Pygen. That will give you client side validation in addition, but require that you generate an SDK for it. Final, option would be to upload the data into RAW and write an transformations for each type of view. That I would avoid if you can.


Anders  Albert
Seasoned Practitioner
Forum|alt.badge.img
  • Seasoned Practitioner
  • 108 replies
  • February 11, 2025

Note, I am logging this as a feature request for Toolkit.


Ayush Daruka
Seasoned
  • Seasoned
  • 25 replies
  • February 11, 2025
Anders Albert wrote:

We have views which contain more than 1000 instances but using sdk’s client.data_modeling.instances.search I can get max 1000 instances. I dont get any continuation token to get further instances. I am using search method because here I can provide ViewId for specific view and get all instances.

Search is for fast lookup, and is limited to 1000 instances. The client.data_modeling.instances.list will do pagination for you and thus if you pass limit=-1, will find all instances for a given view.

I would not use Toolkit for the population as you will be forced to store the instances as YAML which will be very verbose.  If you are comfortable with Python I would consider a custom script. Store the data in csv or parquet, have the script load the csv/parquet, and call client.data_modeling.intances.apply(). Alternative, could be to use Pygen. That will give you client side validation in addition, but require that you generate an SDK for it. Final, option would be to upload the data into RAW and write an transformations for each type of view. That I would avoid if you can.

 

@Anders Albert We currently populate instances using a custom script only through a CSV file as you mentioned.
We were just looking around for a better approach to populate instances, if any, through toolkit since we are moving to toolkit for deployment going forward.
I agree that using a YAML would not make sense since it will be very verbose and large for huge amounts of instances. Regarding the final option, is there any specific reason to avoid transformations?


Anders  Albert
Seasoned Practitioner
Forum|alt.badge.img
  • Seasoned Practitioner
  • 108 replies
  • February 13, 2025

@Khilesh Sahu We have release Toolkit `v0.4.7` which has alpha support for populating nodes through a view from a csv or parquet file. 

You enable it in the `cdf.toml` with the following

[alpha_flags]
populate=true

The command is `cdf populate view`. 


Forum|alt.badge.img+1
  • Author
  • Committed
  • 17 replies
  • February 13, 2025
Anders Albert wrote:

@Khilesh Sahu We have release Toolkit `v0.4.7` which has alpha support for populating nodes through a view from a csv or parquet file. 

You enable it in the `cdf.toml` with the following

[alpha_flags]
populate=true

The command is `cdf populate view`. 

Is there any documentation available for it where I can see details like what are required steps before  ‘cdf populate view’ and where to place csv files.

Need details like expected population rate and maximum num of instances we can populate. 

Please share if there is any documentation available and expected release date so that we can analyze it to deploy in prod


Reply


Cookie Policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie Settings