Gathering Interest

Enhanced Access Management for PI Extractor

Related products:Extractors

Forum|Forum|1 year ago
June 26, 2024
8 replies
131 views

Zachary Burke
Committed

Though many companies are pretty liberal with read access on tags within a PI Data Archive, some out there need to control access at a more granular level. (Often driven by the need to protect intellectual property.) Only certain groups of users should be able to see data for certain tags.

In our case, we would like our access model with respect to PI data in CDF to mimic what’s in PI. In a PI Data Archive, this is governed by the datasecurity metadata field. Unfortunately, the PI extractor seems to strip off this metadata on ingestion. Without that, we’re looking at a workaround solution in which we maintain this mapping in CDF and use something like a Cognite function to apply a security category that mimics the PI tag’s data access. However, this does come with some headaches:

We’ll have to maintain a separate mapping of PI tags to the right level of access (that would otherwise be present in that datasecurity metadata field).
The PI tag data can’t be secured before that security category is applied.

I provide all this background information in the hopes that it’s useful for developing features that are useful for this need. But if I had to make a concrete ask:

Provide an option for the PI extractor to not strip off “internal bookkeeping attributes in PI” from incoming PI tags.
Provide a means to apply a security category during the extraction process, itself. (Similar to how the file extractor does.)

Thank you for your attention and looking forward to discussing more!

Everton Colling
Seasoned Practitioner
Forum|Forum|1 year ago
July 1, 2024

Hi @Zachary Burke!

Thank you for suggesting this product idea. We will follow the traction this idea gets in the community. You can expect updates on this post if we decide to include this in our future roadmap, or if we require more information.

Thomas Sjolshagen
Lead Product Manager
Forum|Forum|1 year ago
July 1, 2024

@Zachary Burke - Is it fair to state that the underlying problem here is that ingesting the data into CDF strips configured access controls that are business critical/nice-to-haves (?) for the data?

If so, what is the “source of truth” when it comes to access management, and why is it _not_ reasonable to presume that the source and targets are configured correctly wrt to access controls (not leaving it up to the extractor)?

Zachary Burke
Author
Committed
Forum|Forum|1 year ago
July 2, 2024

That’s correct, @Thomas Sjølshagen! We’re worried about CDF becoming a way to circumvent access control. The “source of truth” here is the local PI System at the site. Determining which PI tags can be viewed by who is already part of their workflow.

Please correct me if I’m wrong, but I’m not sure how the PI System can act as the source of truth (with respect to access control) without the extractor being involved. In a perfect world, these PI tags would be timeseries segregated by datasets with the appropriate access assigned to each one. However, getting the extractor to determine which PI tag should go to which dataset is where we’d have a tough time. One could argue that it should be based on the pointsource property on each PI tag, but point sources and access control don’t line up very well in this case, I’m afraid. It appears that the ultimate source of truth we need to chase is that datasecurity property that the extractor strips out. We also need to be sensitive of the situation where a PI tag’s security settings change. That shouldn’t occur very frequently at all, but if that does happen we need to be prepared. Moving a timeseries between datasets in response feels awkward, hence our use of security categories.

Our work around solution we’re designing is a transformation that applies the right security categories to new tags that appear in CDF (I’m estimating there could be up to 40 of these needed to match access patterns in PI). Because there’s no connection to that datasecurity property in PI, this will take some effort to maintain (and we’ll have to be wary of situations where someone alters the security settings on a PI tag that has already been brought over into CDF). It’s a hacky solution, so we’re looking for something more long-term.

(Of course, if there’s something we’re missing and there’s a more elegant solution we could pursue with existing extractor / CDF functionality, we would love to hear about it!)

Thomas Sjolshagen
Lead Product Manager
Forum|Forum|1 year ago
July 3, 2024

The implications of access control (sort of) being set by the extractor from a source system comes with a lot of possible complications, so I suspect it would be good to learn a lot more about whether, and how to effectively - across multiple source systems - figure out how to solve the problem - of not accidentally exposing data that is restricted for a given user class/type in a source system - in CDF (FYI @sunil)

My current hypothesis is that solving this in the extractors themselves has the hallmarks of making the extractors really complicated, with minimal value in terms of easily and securely transporting source system data to CDF so we enable and simplify value generation across data sources. As a result, I’m currently disinclined to pursue this from an extractor specific perspective.

BUT, I recognize the problem you’ve described (why I tagged Sunil in this reply), will track its follow-up in the product management group, and keep the idea in “gathering interest” status until Sunil and/or us other PMs have figured out how address the challenge you’re highlighting (since I agree we need to figure out how to resolve this).

Zachary Burke
Author
Committed
Forum|Forum|1 year ago
September 15, 2024

@Thomas Sjølshagen, I hope you don't mind if I check in on this? Has anything that might help answer this need made it on to the product roadmap? We’ve proceeded with a clumsy solution that’s a headache to maintain. I’m hoping we can move away from this soon.

Data from all PI tags for this particular site are being stored in a single data set. We’re implementing access control primarily using security categories that are applied with a data transformation. This data transformation decides which security category should be applied to which time series using a lookup table we’re keeping in a raw. The headache is how we’re populating this lookup table. Someone from the site is periodically exporting all metadata for all PI tags in their PI System. We’re running some scripts to turn what’s in the datasecurity and pointsource fields on each PI tag to infer which security category should be applied. Then we're manually uploading a new version of the lookup table in raw accordingly.

I do believe we could eliminate this manual work if we could at least be given a way to not strip out the datasecurity tag attribute from the metadata the extractor pulls. Hopefully this wouldn't be too hard to implement at least?

Longer term, we're still struggling with how to handle individual PI tags that should be visible to people from a few different process units. In PI, you can add multiple groups to a single PI tag's datasecurity attribute and anyone from any of those groups can see data for that tag. However, security categories in CDF seem to operate the opposite way - someone would have to be a member of all groups in order to see the data. To get around this, we're having to select which process unit it's more important for people to see a tag's data for or we're creating security categories that are combinations of units. However, I can't think of a good alternative for mimicking this that would make sense with the current access structure in CDF. But from what I've heard from @Sunil Krishnamoorthy , it does sound like there are some long-term changes in CDF access coming that might help us with a more elegant solution for this!

Thomas Sjolshagen
Lead Product Manager
Forum|Forum|1 year ago
September 16, 2024

I don’t mind @Zachary Burke, but we’ll bring @Sunil Krishnamoorthy’s attention to this thread since this problem has a much bigger landing space than the extractors (it’s has more of a “how to reflect data-access limitations from source to consuming application” ring to it, from my perspective).

I think you’re essentially bringing up the following problem/job to be done:

The (union of?) access control configured for a connected source system must be respected and reflected by CDF for access to data integrated from the source system(s), throughout the data journey from source to consumer.

Does that - broadly - cover it?

Zachary Burke
Author
Committed
Forum|Forum|1 year ago
September 19, 2024

I think you’re essentially bringing up the following problem/job to be done:

Does that - broadly - cover it?

Yes, that covers it! The entire problem can be summed up as we don’t want CDF to be a potential way to bypass access control in source systems (like PI).

I realize that’s a very large problem to tackle that will take some careful thought. In the meantime, it would be extremely helpful if we could have a way to not strip out the datasecurity metadata field on time series from PI tags. If we just had that, we could at least eliminate manual steps we’re currently having to do now. Would it help if I made a separate product idea for that?

Thomas Sjolshagen
Lead Product Manager
Forum|Forum|1 year ago
September 19, 2024

My primary concern, and I _believe_ the primary reason why we’re omitting it and other security related metadata fields when ingesting data to the Time Series service/API, has to do with the available data size for time series metadata.

In the classic Time Series API, this is very much a fixed size space, so depending on the amount of information held in the datasecurity metadata field, plus any other metadata you need/want stored, knowing where to “cut things off” is going to be a highly dynamic problem for the extractor. And it could mean that for different time series, the metadata stored could differ (which I suspect is not going to be a great experience). So this isn’t something I think makes sense to do, from a cohesive customer/product experience perspective.

The new Data Modeling TimeSeries type - part of the new CDM launched in the recent Q3 release - would let you be more explicit about the Time Series metadata, and for that it may make more sense to support extracting the security metadata from the source. But migrating to CDM is a bit of a tall order to ask…

Sign up

Welcome to Cognite Hub

Scanning file for viruses.

This file cannot be downloaded