How to use contextualization with the SDK

  • 14 December 2023
  • 3 replies
  • 112 views

Userlevel 5

Contextualization is one of CDF’s core strengths. It allows you to link resources from different data sources together, which can be pretty powerful in an industrial context. We can for example think about linking an asset hierarchy based on SAP, to some time series extracted from sensors. You can contextualize CDF resources in the UI. In case you need more filtering capabilities, or you want to automate that, you can also do that programmatically. 

 

In this article, we’ll take the example of contextualizing files and assets with the SDK. The setup we will use will look like that:

  • 20 assets 
  • 20 files, for which something similar to the asset name appears in a metadata field called “asset”

 

It is often the case that files and assets don’t share something obvious to make a link on, because they might be ingested from different sources. That is where contextualization comes into play, using Machine Learning to find matches between resources based on specified fields.

 

To create the setup, we first need to create a set of assets and files that we want to contextualize later. Here we simply upload empty files for the example.

 

client = CogniteClient()

# Creating a specific dataset
DATASET = "contextualization"
dataset = client.data_sets.create(DataSet(name=DATASET, external_id=DATASET))


# Creating assets and files
for i in range(100, 120):
client.assets.create(
Asset(external_id=f"my_asset_{i}", name=f"Asset nb {i}", data_set_id=dataset.id)
)
client.files.upload_bytes(
io.BytesIO(),
name="A file",
metadata={"asset": f"PREFIX_my_asset_{i}"},
data_set_id=dataset.id,
external_id=f"ABC{1000-i}",
)

 

The contextualization happens in three steps: fitting the model, making match predictions, using those matches to link files to assets.

 

Fitting the model happens with one client call, as in the code snippet below. Once the model is fitted, we can simply apply it to our data with the predict method. The predict method will return all the matches the model found, that you can then process as you want. In our case, we are contextualizing based on the assets’ external IDs and the files’ metadata “asset” value, using the match_fields parameter. You could use other things like labels, name etc. depending on your use case and data. (This method returns a contextualisation job. To retrieve the actual results, use the .result method to wait for the result and return it. Read more about it here: https://cognite-sdk-python.readthedocs-hosted.com/en/latest/contextualization.html#predict-using-an-entity-matching-model)

 

sources = client.files.list(limit=None, data_set_external_ids=DATASET)
targets = client.assets.list(limit=None, data_set_external_ids=DATASET)


contextualization_model_ext_id = "my_contextualization_model"
client.entity_matching.fit(
external_id=contextualization_model_ext_id,
sources=sources,
targets=targets,
match_fields=[{"source": "metadata.asset", "target": "externalId"}],
).wait_for_completion()
result = client.entity_matching.predict(external_id=contextualization_model_ext_id).result

 

You can also use the true_matches parameter, which indicates some source-target pairs for which you know there is an exact match.

 

true_matches = [
{"sourceExternalId": "ABC900", "targetExternalId": "my_asset_100"},
{"sourceExternalId": "ABC899", "targetExternalId": "my_asset_101"},
{"sourceExternalId": "ABC898", "targetExternalId": "my_asset_102"},
{"sourceExternalId": "ABC897", "targetExternalId": "my_asset_103"},
{"sourceExternalId": "ABC890", "targetExternalId": "my_asset_110"},
{"sourceExternalId": "ABC881", "targetExternalId": "my_asset_119"},
]


sources = client.files.list(limit=None)
targets = client.assets.list(limit=None)


contextualization_model_ext_id = "my_contextualization_model"
client.entity_matching.fit(
external_id=contextualization_model_ext_id,
sources=sources,
targets=targets,
match_fields=[{"source": "metadata.asset", "target": "externalId"}],
true_matches=true_matches,
).wait_for_completion()


result = client.entity_matching.predict(
external_id=contextualization_model_ext_id, num_matches=1
).result

 

As you can see here, assets and files we are using are just lists we retrieve using the SDK. So here, we can use any resources we want, using any filter we want: we could filter based on the names, labels, datasets etc. 

 

You can also change the number of matches per source, specify a threshold for returned matches. Moreover, you can experiment with different feature types and classifiers (cf docs: https://api-docs.cognite.com/20230101/tag/Entity-matching/operation/entityMatchingCreate). As a general advice, I would recommend to read the docs and test different configurations to pick the one that produces the best results, according to your needs.

Once you have the model’s prediction, you can decide to do what you want with them. For example here, we will link the assets to the matched file as follows: 

matches = list(filter(lambda x: x["matches"], result["items"]))
file_updates = [FileMetadataUpdate(external_id=item["source"]["externalId"]).asset_ids.add([item["matches"][0]["target"]["id"]]) for item in matches]
client.files.update(file_updates)

 

That’s it, our files and assets are contextualized !

 

Using the UI and the SDK both have their pros and cons. There is a tradeoff you have to assess based on your use case. As we have seen in this article, using the SDK allows for more custom filtering for selecting source and target resources. 

If you have any questions about contextualizing your data, please contact us !


3 replies

Badge +1

great to have you back in Cognite sharing best practices with the Community @Pierre Pernot 

Userlevel 3
Badge

Great article!

Userlevel 4

Great article!
The Data Engineer Basics - Transform and Contextualize learning path on Cognite Academy has demonstrations and hands-on exercises with example notebooks for those who want to take their learning one step further.

Reply