Is it possible to specify whether time series data is aggregated at "start", "midpoint", or "end"?

10 months ago
21 February 2024
7 replies
79 views

Anders Brakestad
Seasoned
32 replies

I am making comparisons between time series data in CDF and PI. The reason is that in our tenants the CDF data is not 100% accurate compared to PI.

However, from my testing I think that PI performs its aggregations with the timestamps “centered” at the aggregated time periods, while CDF puts the timestamps at the start of each aggregated period. Is it possible to specify how this is done with the Python API? From my study of the docs it appears not to be the case. The same applies to the PI Web API as well: I cannot specify how the timestamps are placed. The agreement with PI becomes significantly better if I place the CDF timestamps at the center of the aggregated time periods.

My current workaround is the following:

Fetch RAW data from CDF
Shift the timestamps by 0.5x of the granularity
Resample to the desired granularity
Compute mean
Interplate any missing values

The issue is that fetching raw data is a lot more time consuming than fetching aggregates. I have been playing with fetching aggregates from CDF and performing the shift after the fact, but this does not lead to as good agreement. I have also played around with speeding up the raw data ferch by chunking the time periods and fetching with multiple threads or processes, but the speedup is not significant.

Here is an example of how I do the CDF raw data fetch in order to get good agreement.

ts = self.cdf_client.time_series.data.retrieve_dataframe(
            external_id=get_ts_external_id_from_name(name=self.ts_name, client=self.cdf_client),
            start=self.start_time,
            end=end_time,
            aggregates=None,
            granularity=None,
            limit=None
        ).tz_localize("UTC").tz_convert("CET").shift(0.5, self.sampling_interval).resample(self.sampling_interval).mean().interpolate()

Håkon V. Treider
Practitioner
77 replies
11 months ago
February 21, 2024

The timestamps mark the beginning of each “granularity period” (as you correctly inferred). You can read that and more here: https://developer.cognite.com/dev/concepts/aggregation/#aggregating-data-points

Depending on what your sampling interval is, it may be possible to shift compared to PI. E.g. 1 hour can be replaced by 60 minutes, then with start you can control/shift the sampling period. See image example:

Just a note regarding using .resample(self.sampling_interval).mean().interpolate() , this will compute the simple average (i.e. “sum of values / number of values”), instead of the time-weighted average returned by CDF (as this is a workaround I guess you already know this, but still worth mentioning I think!).

..and a final note, “I have also played around with speeding up the raw data fetch by chunking the time periods and fetching with multiple threads or processes, but the speedup is not significant.”: All the retrieve endpoints for datapoints are quite heavily optimised already, so unless you have multiple credentials to circumvent rate-limiting, your best course of action is to pass all time series you want to fetch in a single call (also, retrieve_arrays has the overall best performance).

Anders Brakestad
Author
Seasoned
32 replies
11 months ago
February 21, 2024

Thank you for the quick reply!

Firstly, I was not aware that the CDF aggregation was a time-weighted average, thank you for mentioning this!

If I shift the start time as you showed, I still get a shift in the CDF data features compared to PI. Here is a screenshot:

CDF vs PI: Start time in CDF fetch offset +0.5 units compared to start time for PI fetch.

Shifting the CDF data by 0.5 units the features overlap much better:

CDF vs PI: CDF index additionally shifted 0.5 units to get better match.

However, this leads to the timestamps no longer being equal for the CDF and PI data. It is not a huge issue if I just compute comparison metrics (RMSE, max errors, percentage errors, etc), but it is extra details that need to be documented and explained.

Perhaps there was something I misunderstood by your example. For reference, here is the fetch code:

cdf = client.time_series.data.retrieve_arrays(
    external_id=get_ts_external_id_from_name(name=ts_name, client=client),
    start=start_time + 0.5 * pd.Timedelta(sampling_interval),
    limit=max_count,
    aggregates="average",
    granularity=sampling_interval
).to_pandas().squeeze().tz_localize("UTC").tz_convert("CET")

So I get exactly the same results regardless of whether I shift the start_time parameter in the CDF fetch relative to the one used for PI fetch.

Glen Sykes
Seasoned Practitioner
123 replies
10 months ago
February 21, 2024

Hi @Anders Brakestad,

I’m the product manager for our Time Series services, and I’m keen to learn any observations, good or otherwise that you find from your evaluation. Would you be available to have a short call with me when you’ve concluded your research?

Kind Regards, Glen

Glen | Cognite Product Management

Anders Brakestad
Author
Seasoned
32 replies
10 months ago
February 22, 2024

Good morning Glen!

Sure, I’ll be happy to share what I find from my small investigations. I’ll keep in touch!

Glen Sykes
Seasoned Practitioner
123 replies
4 months ago
September 4, 2024

Hi Anders, just following up on this. How did your investigations go? Would you mind sharing any insights and learnings you got from the service vs. PI?

Glen | Cognite Product Management

Dilini Fernando
Seasoned Practitioner
671 replies
4 months ago
September 12, 2024

Hi @Anders Brakestad,

Have you had the opportunity to proceed with your investigation?

Øystein Aspøy
Committed
30 replies
4 months ago
September 12, 2024

Hi @Anders Brakestad did you get any further on this?
Take into consideration that the data in CDF and PI might differ. PI might be doing compression when data is stored to the archive, whilst CDF will suck it directly from the PI snapshot and hence store more datapoints if there are any compression set in PI. So when you do aggregation there might be a small deviation since one would have more granular data in CDF in some cases where there are compression turned on in PI.

Reply

Cookie Policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos

Reply

Related topics

Aggregates of synthetic timeseries on templates / datapointsWithGranularityicon

Aggregates on synthetic time-series

aggregate=sum for synthetic timeseries

Cognite Data Fusion: Include aggregates in the output from Synthetic TimeSeries

Synthetic timeseries API does not support "sum" aggregationicon

Sign up

Log in to the community

Scanning file for viruses.

This file cannot be downloaded

Cookie Policy

Cookie settings