Question

Is it possible to specify whether time series data is aggregated at "start", "midpoint", or "end"?

  • 21 February 2024
  • 4 replies
  • 57 views

I am making comparisons between time series data in CDF and PI. The reason is that in our tenants the CDF data is not 100% accurate compared to PI.

However, from my testing I think that PI performs its aggregations with the timestamps “centered” at the aggregated time periods, while CDF puts the timestamps at the start of each aggregated period. Is it possible to specify how this is done with the Python API? From my study of the docs it appears not to be the case. The same applies to the PI Web API as well: I cannot specify how the timestamps are placed. The agreement with PI becomes significantly better if I place the CDF timestamps at the center of the aggregated time periods.

My current workaround is the following:

  1. Fetch RAW data from CDF
  2. Shift the timestamps by 0.5x of the granularity
  3. Resample to the desired granularity
  4. Compute mean
  5. Interplate any missing values

The issue is that fetching raw data is a lot more time consuming than fetching aggregates. I have been playing with fetching aggregates from CDF and performing the shift after the fact, but this does not lead to as good agreement. I have also played around with speeding up the raw data ferch by chunking the time periods and fetching with multiple threads or processes, but the speedup is not significant.

Here is an example of how I do the CDF raw data fetch in order to get good agreement.

ts = self.cdf_client.time_series.data.retrieve_dataframe(
external_id=get_ts_external_id_from_name(name=self.ts_name, client=self.cdf_client),
start=self.start_time,
end=end_time,
aggregates=None,
granularity=None,
limit=None
).tz_localize("UTC").tz_convert("CET").shift(0.5, self.sampling_interval).resample(self.sampling_interval).mean().interpolate()

 


4 replies

Good morning Glen!

Sure, I’ll be happy to share what I find from my small investigations. I’ll keep in touch!

Userlevel 3

Hi @Anders Brakestad,

I’m the product manager for our Time Series services, and I’m keen to learn any observations, good or otherwise that you find from your evaluation.  Would you be available to have a short call with me when you’ve concluded your research?

Kind Regards, Glen

Thank you for the quick reply!

Firstly, I was not aware that the CDF aggregation was a time-weighted average, thank you for mentioning this!

If I shift the start time as you showed, I still get a shift in the CDF data features compared to PI. Here is a screenshot:

CDF vs PI: Start time in CDF fetch offset +0.5 units compared to start time for PI fetch.

 

Shifting the CDF data by 0.5 units the features overlap much better:

CDF vs PI: CDF index additionally shifted 0.5 units to get better match.

However, this leads to the timestamps no longer being equal for the CDF and PI data. It is not a huge issue if I just compute comparison metrics (RMSE, max errors, percentage errors, etc), but it is extra details that need to be documented and explained.

Perhaps there was something I misunderstood by your example. For reference, here is the fetch code:

cdf = client.time_series.data.retrieve_arrays(
external_id=get_ts_external_id_from_name(name=ts_name, client=client),
start=start_time + 0.5 * pd.Timedelta(sampling_interval),
limit=max_count,
aggregates="average",
granularity=sampling_interval
).to_pandas().squeeze().tz_localize("UTC").tz_convert("CET")

 

So I get exactly the same results regardless of whether I shift the start_time parameter in the CDF fetch relative to the one used for PI fetch.

Userlevel 4
Badge

The timestamps mark the beginning of each “granularity period” (as you correctly inferred). You can read that and more here: https://developer.cognite.com/dev/concepts/aggregation/#aggregating-data-points

Depending on what your sampling interval is, it may be possible to shift compared to PI. E.g. 1 hour can be replaced by 60 minutes, then with start you can control/shift the sampling period. See image example:

 

Just a note regarding using .resample(self.sampling_interval).mean().interpolate() , this will compute the simple average (i.e. “sum of values / number of values”), instead of the time-weighted average returned by CDF (as this is a workaround I guess you already know this, but still worth mentioning I think!).

..and a final note, “I have also played around with speeding up the raw data fetch by chunking the time periods and fetching with multiple threads or processes, but the speedup is not significant.”: All the retrieve endpoints for datapoints are quite heavily optimised already, so unless you have multiple credentials to circumvent rate-limiting, your best course of action is to pass all time series you want to fetch in a single call (also, retrieve_arrays has the overall best performance).

Reply