I am trying to make a comparison between time series data from CDF and from Pi (accessed with the Seeq Python API). Some questions arise:
- How do I ensure that the DateTimeIndex-s that are returned are identical?
- Why does CDF return DateTimeIndex with dates before my start value?
Examples:
Pulling a raw time series from Pi via Seeq looks like this:
start = datetime(2021, 1, 1)
end = datetime(2021, 1, 2)
spy.pull(items, start=start, end=end, grid=None, quiet=True)
And the CDF analog looks like this:
start = datetime(2021, 1, 1)
end = datetime(2021, 1, 2)
cognite.datapoints.retrieve_dataframe(
external_id=external_id,
start=start,
end=end,
granularity=None
)
The resulting DateTimeIndex-es are not identical:
From Pi via Seeq:
DatetimeIndex(t'2021-01-01 00:00:00+01:00', '2021-01-01 00:00:05+01:00',
'2021-01-01 00:00:10+01:00', '2021-01-01 00:00:15+01:00',
'2021-01-01 00:00:20+01:00', '2021-01-01 00:00:25+01:00',
'2021-01-01 00:00:30+01:00', '2021-01-01 00:00:35+01:00',
'2021-01-01 00:00:40+01:00', '2021-01-01 00:00:45+01:00',
...
'2021-01-01 23:59:15+01:00', '2021-01-01 23:59:20+01:00',
'2021-01-01 23:59:25+01:00', '2021-01-01 23:59:30+01:00',
'2021-01-01 23:59:35+01:00', '2021-01-01 23:59:40+01:00',
'2021-01-01 23:59:45+01:00', '2021-01-01 23:59:50+01:00',
'2021-01-01 23:59:55+01:00', '2021-01-02 00:00:00+01:00'],
dtype='datetime64dns, CET]', length=17425, freq=None)
From CDF:
DatetimeIndex(t'2020-12-31 23:00:00', '2020-12-31 23:00:05',
'2020-12-31 23:00:10', '2020-12-31 23:00:15',
'2020-12-31 23:00:20', '2020-12-31 23:00:25',
'2020-12-31 23:00:30', '2020-12-31 23:00:35',
'2020-12-31 23:00:40', '2020-12-31 23:00:45',
...
'2021-01-01 22:59:10', '2021-01-01 22:59:15',
'2021-01-01 22:59:20', '2021-01-01 22:59:25',
'2021-01-01 22:59:30', '2021-01-01 22:59:35',
'2021-01-01 22:59:40', '2021-01-01 22:59:45',
'2021-01-01 22:59:50', '2021-01-01 22:59:55'],
dtype='datetime64dns]', length=17371, freq=None)
I notice here a couple of things:
- The length of the time span is different, 17425 vs 17371.
- The index from Seeq has CET time zone information, while the index from CDF is naïve. Looking at the source code in datapoints.py I fail to notice any explicit tz processing. The relevant code snippet is this
start = pd.Timestamp(min(q.start for q in fetcher.agg_queries), unit="ms")
end = pd.Timestamp(max(q.end for q in fetcher.agg_queries), unit="ms")
(granularity,) = grans_given
# Pandas understand "Cognite granularities" except `m` (minutes) which we must translate:
freq = cast(str, granularity).replace("m", "T")
return df.reindex(pd.date_range(start=start, end=end, freq=freq, inclusive="left"))
- The CDF index contains data from before the start date specified. The docs says that start is inclusive, but I take that to mean that the interval estart, end) so that the dates need to be greater or equal to start and smaller than end. Am I missing something
- The CDF index seem to be shifted one hour earlier than the Seeq index. Is this due to possible time zone differences? Is the following then an accurate and robust way to align the indeces?
df_cdf.index = df_cdf.index.shift(1, "h").tz_localize("CET")
Aligning the indeces is just a technical matter - I am a bit more concerned about the mismatch in number of observations. CDF ingests Pi time series, and when recovering the raw data I would expect the indeces to be identical. Is there perhaps something I am missing here?
All help much appreciated! :)
Anders