Hey community we are trying to be more active in this cognite hub page and add content consistently from a month to month basis. This will include a series of articles on alerting and monitoring as well as plans regarding where AIR is going into the future.
Why Alerting?
Automated alerting is an essential part of monitoring. They allow you to spot issues with equipment groups, time series, data quality, pipelines etc.
But alerts aren’t always as effective as they could be. In particular, real problems are often lost in a sea of noisy alarms. In short:
- Alert liberally meaning its ok to spam users rather than not alert at all
- Make sure that the user has the complete control of how and when they want to be alerted
Inherent challenges in Alerting
- Sensitivity: Overly sensitive systems cause excessive false positive alerts, while less sensitive systems can miss issues and have false negatives. Determining the correct alerting threshold requires ongoing tuning and refinement.
- Fatigue: The common approach to sensitivity is for teams to be more conservative when they set up alerts, but this results in a more sensitive and noisy alerting system. If teams encounter too many false positives, they will begin to ignore alerts and miss real issues, defeating the purpose of an alerting system.
- Maintenance: Systems grow and evolve quickly, but teams are often slow to alerting policies. This leads to an alerting strategy that is simultaneously filled with outdated policy deadwood and gaps where teams aren’t providing coverage to newer changes in their systems.
- Fragmented information: Many teams use multiple different systems to manage alerts across increasingly complex technology stacks, which means that the information needed to diagnose and troubleshoot a problem may be spread across multiple tools.
- Scaling: One of the key challenges in alerting is not only doing this over a single time series but doing this many such groups of time series or other data points so we have actual value derived from a system level perspective.
How do we deal with these challenges at cognite through AIR?
- We minimize the number of fragmented subsystems by making seamless interaction between the SME and the data scientist very easy to do. AIR works with all the data that has been contectualised and stored within CDF with a convenient front end eliminating the need to connect various systems to have a working alerting system.
- AIR allows users to tweak various parameters in the alert creation itself as well as allowing users to customise their own model using python code. This enables scaling over many use cases and solving various sensitivity and fatigue issues with respect to alerting.
- Scaling alerting and monitoring is something that is very much in the backlog of the AIR team and is something we are discussing as we move into CDF in the months to come and with FDM being a powerful tool available to everybody within CDF it is just a matetr of time.
Do agree that these are some of the challenges we have in the alerting and monitoring space ? Is there something we may have missed out ? Is there anything that we are missing in AIR today that would help towards this goal of making alerting accesible and easy to use ? Feel free to leave comments in this thread and we would love to hear what you have to say.