Skip to main content

Is it possible to specify whether time series data is aggregated at "start", "midpoint", or "end"?

  • 21 February 2024
  • 7 replies
  • 79 views

Anders Brakestad
Seasoned

I am making comparisons between time series data in CDF and PI. The reason is that in our tenants the CDF data is not 100% accurate compared to PI.

However, from my testing I think that PI performs its aggregations with the timestamps “centered” at the aggregated time periods, while CDF puts the timestamps at the start of each aggregated period. Is it possible to specify how this is done with the Python API? From my study of the docs it appears not to be the case. The same applies to the PI Web API as well: I cannot specify how the timestamps are placed. The agreement with PI becomes significantly better if I place the CDF timestamps at the center of the aggregated time periods.

My current workaround is the following:

  1. Fetch RAW data from CDF
  2. Shift the timestamps by 0.5x of the granularity
  3. Resample to the desired granularity
  4. Compute mean
  5. Interplate any missing values

The issue is that fetching raw data is a lot more time consuming than fetching aggregates. I have been playing with fetching aggregates from CDF and performing the shift after the fact, but this does not lead to as good agreement. I have also played around with speeding up the raw data ferch by chunking the time periods and fetching with multiple threads or processes, but the speedup is not significant.

Here is an example of how I do the CDF raw data fetch in order to get good agreement.

ts = self.cdf_client.time_series.data.retrieve_dataframe(
            external_id=get_ts_external_id_from_name(name=self.ts_name, client=self.cdf_client),
            start=self.start_time,
            end=end_time,
            aggregates=None,
            granularity=None,
            limit=None
        ).tz_localize("UTC").tz_convert("CET").shift(0.5, self.sampling_interval).resample(self.sampling_interval).mean().interpolate()

 

7 replies

Forum|alt.badge.img

The timestamps mark the beginning of each “granularity period” (as you correctly inferred). You can read that and more here: https://developer.cognite.com/dev/concepts/aggregation/#aggregating-data-points

Depending on what your sampling interval is, it may be possible to shift compared to PI. E.g. 1 hour can be replaced by 60 minutes, then with start you can control/shift the sampling period. See image example:

 

Just a note regarding using .resample(self.sampling_interval).mean().interpolate() , this will compute the simple average (i.e. “sum of values / number of values”), instead of the time-weighted average returned by CDF (as this is a workaround I guess you already know this, but still worth mentioning I think!).

..and a final note, “I have also played around with speeding up the raw data fetch by chunking the time periods and fetching with multiple threads or processes, but the speedup is not significant.”: All the retrieve endpoints for datapoints are quite heavily optimised already, so unless you have multiple credentials to circumvent rate-limiting, your best course of action is to pass all time series you want to fetch in a single call (also, retrieve_arrays has the overall best performance).


Anders Brakestad
Seasoned

Thank you for the quick reply!

Firstly, I was not aware that the CDF aggregation was a time-weighted average, thank you for mentioning this!

If I shift the start time as you showed, I still get a shift in the CDF data features compared to PI. Here is a screenshot:

CDF vs PI: Start time in CDF fetch offset +0.5 units compared to start time for PI fetch.

 

Shifting the CDF data by 0.5 units the features overlap much better:

CDF vs PI: CDF index additionally shifted 0.5 units to get better match.

However, this leads to the timestamps no longer being equal for the CDF and PI data. It is not a huge issue if I just compute comparison metrics (RMSE, max errors, percentage errors, etc), but it is extra details that need to be documented and explained.

Perhaps there was something I misunderstood by your example. For reference, here is the fetch code:

cdf = client.time_series.data.retrieve_arrays(
    external_id=get_ts_external_id_from_name(name=ts_name, client=client),
    start=start_time + 0.5 * pd.Timedelta(sampling_interval),
    limit=max_count,
    aggregates="average",
    granularity=sampling_interval
).to_pandas().squeeze().tz_localize("UTC").tz_convert("CET")

 

So I get exactly the same results regardless of whether I shift the start_time parameter in the CDF fetch relative to the one used for PI fetch.


Glen Sykes
Seasoned Practitioner
  • Seasoned Practitioner
  • 123 replies
  • February 21, 2024

Hi @Anders Brakestad,

I’m the product manager for our Time Series services, and I’m keen to learn any observations, good or otherwise that you find from your evaluation.  Would you be available to have a short call with me when you’ve concluded your research?

Kind Regards, Glen


Anders Brakestad
Seasoned

Good morning Glen!

Sure, I’ll be happy to share what I find from my small investigations. I’ll keep in touch!


Glen Sykes
Seasoned Practitioner
  • Seasoned Practitioner
  • 123 replies
  • September 4, 2024

Hi Anders, just following up on this.  How did your investigations go?  Would you mind sharing any insights and learnings you got from the service vs. PI?


Dilini Fernando
Seasoned Practitioner
Forum|alt.badge.img+2
  • Seasoned Practitioner
  • 671 replies
  • September 12, 2024

Hi @Anders Brakestad,

Have you had the opportunity to proceed with your investigation?


Øystein Aspøy
Committed
Forum|alt.badge.img+3

Hi @Anders Brakestad did you get any further on this?
Take into consideration that the data in CDF and PI might differ. PI might be doing compression when data is stored to the archive, whilst CDF will suck it directly from the PI snapshot and hence store more datapoints if there are any compression set in PI. So when you do aggregation there might be a small deviation since one would have more granular data in CDF in some cases where there are compression turned on in PI.


Reply


Cookie Policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie Settings