Solved

How to ensure identical DateTimeIndex in TS data from CDF and Pi (via Seeq API)

  • 14 March 2023
  • 7 replies
  • 106 views

I am trying to make a comparison between time series data from CDF and from Pi (accessed with the Seeq Python API). Some questions arise:​​​​​

  • How do I ensure that the DateTimeIndex-s that are returned are identical?
  • Why does CDF return DateTimeIndex with dates before my start value?

Examples:

Pulling a raw time series from Pi via Seeq looks like this:

start = datetime(2021, 1, 1)
end = datetime(2021, 1, 2)
spy.pull(items, start=start, end=end, grid=None, quiet=True)

And the CDF analog looks like this:

start = datetime(2021, 1, 1)
end = datetime(2021, 1, 2)
cognite.datapoints.retrieve_dataframe(
external_id=external_id,
start=start,
end=end,
granularity=None
)

The resulting DateTimeIndex-es are not identical:

From Pi via Seeq:

DatetimeIndex(['2021-01-01 00:00:00+01:00', '2021-01-01 00:00:05+01:00',
'2021-01-01 00:00:10+01:00', '2021-01-01 00:00:15+01:00',
'2021-01-01 00:00:20+01:00', '2021-01-01 00:00:25+01:00',
'2021-01-01 00:00:30+01:00', '2021-01-01 00:00:35+01:00',
'2021-01-01 00:00:40+01:00', '2021-01-01 00:00:45+01:00',
...
'2021-01-01 23:59:15+01:00', '2021-01-01 23:59:20+01:00',
'2021-01-01 23:59:25+01:00', '2021-01-01 23:59:30+01:00',
'2021-01-01 23:59:35+01:00', '2021-01-01 23:59:40+01:00',
'2021-01-01 23:59:45+01:00', '2021-01-01 23:59:50+01:00',
'2021-01-01 23:59:55+01:00', '2021-01-02 00:00:00+01:00'],
dtype='datetime64[ns, CET]', length=17425, freq=None)

From CDF:

DatetimeIndex(['2020-12-31 23:00:00', '2020-12-31 23:00:05',
'2020-12-31 23:00:10', '2020-12-31 23:00:15',
'2020-12-31 23:00:20', '2020-12-31 23:00:25',
'2020-12-31 23:00:30', '2020-12-31 23:00:35',
'2020-12-31 23:00:40', '2020-12-31 23:00:45',
...
'2021-01-01 22:59:10', '2021-01-01 22:59:15',
'2021-01-01 22:59:20', '2021-01-01 22:59:25',
'2021-01-01 22:59:30', '2021-01-01 22:59:35',
'2021-01-01 22:59:40', '2021-01-01 22:59:45',
'2021-01-01 22:59:50', '2021-01-01 22:59:55'],
dtype='datetime64[ns]', length=17371, freq=None)

I notice here a couple of things:

  1. The length of the time span is different, 17425 vs 17371.
  2. The index from Seeq has CET time zone information, while the index from CDF is naïve. Looking at the source code in datapoints.py I fail to notice any explicit tz processing. The relevant code snippet is this
start = pd.Timestamp(min(q.start for q in fetcher.agg_queries), unit="ms")
end = pd.Timestamp(max(q.end for q in fetcher.agg_queries), unit="ms")
(granularity,) = grans_given
# Pandas understand "Cognite granularities" except `m` (minutes) which we must translate:
freq = cast(str, granularity).replace("m", "T")
return df.reindex(pd.date_range(start=start, end=end, freq=freq, inclusive="left"))
  1. The CDF index contains data from before the start date specified. The docs says that start is inclusive, but I take that to mean that the interval [start, end) so that the dates need to be greater or equal to start and smaller than end. Am I missing something
  2. The CDF index seem to be shifted one hour earlier than the Seeq index. Is this due to possible time zone differences? Is the following then an accurate and robust way to align the indeces?
df_cdf.index = df_cdf.index.shift(1, "h").tz_localize("CET")

 

Aligning the indeces is just a technical matter - I am a bit more concerned about the mismatch in number of observations. CDF ingests Pi time series, and when recovering the raw data I would expect the indeces to be identical. Is there perhaps something I am missing here?

 

All help much appreciated! :)

Anders

icon

Best answer by Håkon V. Treider 14 March 2023, 14:29

View original

7 replies

Userlevel 3
Badge

Yeah, I also sett that. I use

datetime.utcnow()

and convert it manualy to my timezone by shifting the DF with one hour. then the code becomes independant on where ever the code is excecute from. 

But @Håkon V. Treider  has made two good guide for pyhton here: 

Where he on the the part 2 goes into how to do datetime stuff :) 

Userlevel 3
Badge
#the dtype in pandas is in datetime64[ns] so when we view it in the np.int64 dtype we get the number og ns since unix epoc (1.1.1970)
#shifte datatime away from utc to local time zone
ns = df.index.view(np.int64) #converts it to unixepoc in nano seconds
ns += 1*60*60*10**9 #adjusting it 1 hours froward in nanoseconds
df.set_index(ns.view('datetime64[ns]'))

One way of doing it is this :) 

 

Thanks. I looked at the guide, but I must admit I am not sure how to use it to help me now 😅

Am I correct to assume that all dates coming from the Cognite API are in UTC? And that it is not possible to pass timezone information to the API so that the returned dates actually correspond to what I actually was after?

So whenever I fetch a DataFrame from CDF I need to do a shift by one hour to manually convert from UTC to CET?

Userlevel 3
Badge

Yeah, I just compare it to what i have available to me on the SCADA system or PI vision or some other place, and do the correction from that. 

Here is a code for getting the data from this current start of a production day, witch goes from 06:00 to 06:00 each day. 

dtn = datetime.now()
start = datetime(dtn.year,dtn.month,dtn.day,6)
end = 'now'

# This retreives data in utc atleast from the CDF project I'm working against
df = client.datapoints.retrieve(id=list_of_int_ids,tart=start, end=end).to_pandas()
#shifte datatime away from utc to local time zone witch is one our forward
ns = df.index.view(np.int64)
ns += 1*60*60*10**9 #adjusting it 1 hours froward such that timzones match.
df.set_index(ns.view('datetime64[ns]'))

But If there is away of getting the data straight from CDF in my local timezone by default that would be great to know. Hope some here can help for that, does anyone know if it can be automatically retrieve by using timezone environmental variables or something on the API / SDK level? 

Userlevel 4
Badge

Time series data in CDF will always return UTC. In v2 of the service, the feature you are requesting might be available (to request data in a specific timezone).

In the meantime, please don’t convert timestamps manually. There are soooooo many non-intuitive traps to fall into. Since it seems you are already using pandas, let it do the conversion for you:

 

Userlevel 4
Badge

Why does CDF return DateTimeIndex with dates before my start value?

I can safely say it doesn’t. Are you specifying data using `datetime` objects?

Edit: I see that you are. Python datetimes are always local time as per the language definiton of them (I guess 95% of people disagree with this, and I am one of them) anyway, we adhere to this. In the guide linked above there are a few ways to solve this, but the easiest is probably this (taken from the official Python SDK docs):

from datetime import datetime, timezone

start = datetime(2023, 1, 1, tzinfo=timezone.utc)

Ahh, it was the 

df.tz_localize("UTC").tz_convert("CET")

pattern I was not able to come up with on my own :) Thanks! I’m trying to be very careful about the tz conversions, hence my question on this site. Thanks for the help!

Applying the tz conversion does align the Pi and CDF DateTimeIndex-s, apart from the extra timestamps present in the Pi DataFrame. If I drop those, then they match 100%. I suppose there is something with the ingestion pipeline that leads to some timestamps being missed…?

 

Reply