Wednesday Wisdom 1: What is alerting and what are some key flaws with the current solutions ?


Userlevel 4

Hey community we are trying to be more active in this cognite hub page and add content consistently from a month to month basis. This will include a series of articles on alerting and monitoring as well as plans regarding where AIR is going into the future.

Why Alerting?

Automated alerting is an essential part of monitoring. They allow you to spot issues with equipment groups, time series, data quality, pipelines etc.

But alerts aren’t always as effective as they could be. In particular, real problems are often lost in a sea of noisy alarms. In short:

  1. Alert liberally meaning its ok to spam users rather than not alert at all
  2. Make sure that the user has the complete control of how and when they want to be alerted

Inherent challenges in Alerting

  1. Sensitivity: Overly sensitive systems cause excessive false positive alerts, while less sensitive systems can miss issues and have false negatives. Determining the correct alerting threshold requires ongoing tuning and refinement.
  2. Fatigue: The common approach to sensitivity is for teams to be more conservative when they set up alerts, but this results in a more sensitive and noisy alerting system. If teams encounter too many false positives, they will begin to ignore alerts and miss real issues, defeating the purpose of an alerting system.
  3. Maintenance: Systems grow and evolve quickly, but teams are often slow to alerting policies. This leads to an alerting strategy that is simultaneously filled with outdated policy deadwood and gaps where teams aren’t providing coverage to newer changes in their systems.
  4. Fragmented information: Many teams use multiple different systems to manage alerts across increasingly complex technology stacks, which means that the information needed to diagnose and troubleshoot a problem may be spread across multiple tools.
  5. Scaling: One of the key challenges in alerting is not only doing this over a single time series but doing this many such groups of time series or other data points so we have actual value derived from a system level perspective.

How do we deal with these challenges at cognite through AIR?

  1. We minimize the number of fragmented subsystems by making seamless interaction between the SME and the data scientist very easy to do. AIR works with all the data that has been contectualised and stored within CDF with a convenient front end eliminating the need to connect various systems to have a working alerting system.
  2. AIR allows users to tweak various parameters in the alert creation itself as well as allowing users to customise their own model using python code. This enables scaling over many use cases and solving various sensitivity and fatigue issues with respect to alerting.
  3. Scaling alerting and monitoring is something that is very much in the backlog of the AIR team and is something we are discussing as we move into CDF in the months to come and with FDM being a powerful tool available to everybody within CDF it is just a matetr of time.

Do agree that these are some of the challenges we have in the alerting and monitoring space ? Is there something we may have missed out ? Is there anything that we are missing in AIR today that would help towards this goal of making alerting accesible and easy to use ? Feel free to leave comments in this thread and we would love to hear what you have to say.


12 replies

Userlevel 3

I hope we are still keeping this functionality as python is spreading like wildfire in our organization

 

AIR allows users to tweak various parameters in the alert creation itself as well as allowing users to customise their own model using python code. This enables scaling over many use cases and solving various sensitivity and fatigue issues with respect to alerting.

Userlevel 3
Badge

Hi, Ibrahim.

Great to hear how this is valuable for your organization. We are maturing these services, making them more robust and a native part of the Cognite Data Fusion experience. 

What would you say are the biggest categories within monitoring you see the most value of monitoring and alerting? 

Knut

Userlevel 3

@Arun Arunachalam and @Knut Vidvei - The one thing we feel is missing from this list that would make alerting far more valuable to end users (and differentiating from other alerts we already have on our facilities) is the ability to generate an alert workflow given we rarely rely on a single alert to take action.

Example: if we’re monitoring the flow through a pump and we want to be alerted when the pump is encountering issues performing, we expect the flow to go to zero when the entire facility is down. The only time this alert would be of value is if the facility is online and the pump rate drops below a specified threshold. 

Example 2 - most of our RCA’s involve some sort of series of events that are currently captured as RCA’s that are static rather than building it in to a monitoring system. By allowing for this workflow to be “automated” within AIR, we can now ensure that these events are being monitored to be prevented in the future and it would be beneficial to pull up as part of the alert what the outcome of the previous RCA was so that end users know how action to take or to avoid. 

Userlevel 4

@ibrahim.alsyed yes python will continue to be a part of our infrastructure but we are looking in decoupling this functionality from AIR as it is today as it adds a lot of unnecessary complexity for 80 % of use cases which require simple mathematical rules. We are looking to improve on the function deployment experience so that users are able to generate their own KPI’s much easier and therefore be able to do monitoring on them. Finally in the meanwhile while we are moving towards alerting from charts on regular and calculated time series you will still be able to generate KPI’s using functions. It just wont be a tightly coupled experience as it is today.

Userlevel 4

@Arun Arunachalam and @Knut Vidvei - The one thing we feel is missing from this list that would make alerting far more valuable to end users (and differentiating from other alerts we already have on our facilities) is the ability to generate an alert workflow given we rarely rely on a single alert to take action.

Example: if we’re monitoring the flow through a pump and we want to be alerted when the pump is encountering issues performing, we expect the flow to go to zero when the entire facility is down. The only time this alert would be of value is if the facility is online and the pump rate drops below a specified threshold. 

Example 2 - most of our RCA’s involve some sort of series of events that are currently captured as RCA’s that are static rather than building it in to a monitoring system. By allowing for this workflow to be “automated” within AIR, we can now ensure that these events are being monitored to be prevented in the future and it would be beneficial to pull up as part of the alert what the outcome of the previous RCA was so that end users know how action to take or to avoid. 

This is really great feedback for us. How would the fact that a facility is online be represented in CDF would it be an event / time series ? If so one of the things we are looking into is the ability to add multiple rules for our alerts but we still need to do some user interviews to get some feedback on what that truly means.

Userlevel 3

Is Alerting on possible using Time Series data? What if i want alerting on transaction such as Work Orders or Product Recipes that change etc?

Userlevel 3

Can we do unsupervised Machine Learning Anomaly Detection on Time Series Trends in AIR?

Userlevel 3

@Arun Arunachalam  can you please take a look at my questions?

Userlevel 4

Is Alerting on possible using Time Series data? What if i want alerting on transaction such as Work Orders or Product Recipes that change etc?

So alerting as we are working on today is focused on a time series approach and providing a stable and scalable experience not only across numbers but across use cases. Although alerting on events is not something that is coming soon, what we are working on is to enable the infrastructure to allow this to be possible in a scalable manner while we understand the use cases and workflows that into such a event driven system. We are hoping to reach out later in the year to our users as we move towards a more stable product to understand and solve this use cases better so stay tuned.  I do wonder when you say work orders and product recipes what would a representation of that look like in CDF ?

Userlevel 3

It would be events.

Userlevel 4

Can we do unsupervised Machine Learning Anomaly Detection on Time Series Trends in AIR?

This is a bit hard to answer without knowing the details of the models and the inputs that go into these models. With the alerting approach that we going with today the way you could achieve this is by writing a cognite function that runs a pretrained model and writes some kind of KPI. You could then potentially do alerting on these as we expand on more thresholding capabilities. Would be a cool exercise for us to explore these use cases together and figure out what he missing pieces are. I see strong potential in refining that cognite function experience as we get more user input and providing less friction between deploying these models and setting up alerting. 

Userlevel 4

It would be events.

Yeah if this is the case then yeah its something we will start looking into as we come upon a more stable experience for time series alerting in CDF both feature wise and performance wise in the months to come. While we are working on this we will reach out to better understand these use cases and the flows that go into such a event driven alerting system.

Reply