Skip to main content
Solved

How to ensure identical DateTimeIndex in TS data from CDF and Pi (via Seeq API)


Anders Brakestad
Seasoned

I am trying to make a comparison between time series data from CDF and from Pi (accessed with the Seeq Python API). Some questions arise:​​​​​

  • How do I ensure that the DateTimeIndex-s that are returned are identical?
  • Why does CDF return DateTimeIndex with dates before my start value?

Examples:

Pulling a raw time series from Pi via Seeq looks like this:

start = datetime(2021, 1, 1)
end = datetime(2021, 1, 2)
spy.pull(items, start=start, end=end, grid=None, quiet=True)

And the CDF analog looks like this:

start = datetime(2021, 1, 1)
end = datetime(2021, 1, 2)
cognite.datapoints.retrieve_dataframe(
    external_id=external_id,
    start=start,
    end=end,
    granularity=None
)

The resulting DateTimeIndex-es are not identical:

From Pi via Seeq:

DatetimeIndex(['2021-01-01 00:00:00+01:00', '2021-01-01 00:00:05+01:00',
               '2021-01-01 00:00:10+01:00', '2021-01-01 00:00:15+01:00',
               '2021-01-01 00:00:20+01:00', '2021-01-01 00:00:25+01:00',
               '2021-01-01 00:00:30+01:00', '2021-01-01 00:00:35+01:00',
               '2021-01-01 00:00:40+01:00', '2021-01-01 00:00:45+01:00',
               ...
               '2021-01-01 23:59:15+01:00', '2021-01-01 23:59:20+01:00',
               '2021-01-01 23:59:25+01:00', '2021-01-01 23:59:30+01:00',
               '2021-01-01 23:59:35+01:00', '2021-01-01 23:59:40+01:00',
               '2021-01-01 23:59:45+01:00', '2021-01-01 23:59:50+01:00',
               '2021-01-01 23:59:55+01:00', '2021-01-02 00:00:00+01:00'],
              dtype='datetime64[ns, CET]', length=17425, freq=None)

From CDF:

DatetimeIndex(['2020-12-31 23:00:00', '2020-12-31 23:00:05',
               '2020-12-31 23:00:10', '2020-12-31 23:00:15',
               '2020-12-31 23:00:20', '2020-12-31 23:00:25',
               '2020-12-31 23:00:30', '2020-12-31 23:00:35',
               '2020-12-31 23:00:40', '2020-12-31 23:00:45',
               ...
               '2021-01-01 22:59:10', '2021-01-01 22:59:15',
               '2021-01-01 22:59:20', '2021-01-01 22:59:25',
               '2021-01-01 22:59:30', '2021-01-01 22:59:35',
               '2021-01-01 22:59:40', '2021-01-01 22:59:45',
               '2021-01-01 22:59:50', '2021-01-01 22:59:55'],
              dtype='datetime64[ns]', length=17371, freq=None)

I notice here a couple of things:

  1. The length of the time span is different, 17425 vs 17371.
  2. The index from Seeq has CET time zone information, while the index from CDF is naïve. Looking at the source code in datapoints.py I fail to notice any explicit tz processing. The relevant code snippet is this
start = pd.Timestamp(min(q.start for q in fetcher.agg_queries), unit="ms")
end = pd.Timestamp(max(q.end for q in fetcher.agg_queries), unit="ms")
(granularity,) = grans_given
# Pandas understand "Cognite granularities" except `m` (minutes) which we must translate:
freq = cast(str, granularity).replace("m", "T")
return df.reindex(pd.date_range(start=start, end=end, freq=freq, inclusive="left"))
  1. The CDF index contains data from before the start date specified. The docs says that start is inclusive, but I take that to mean that the interval [start, end) so that the dates need to be greater or equal to start and smaller than end. Am I missing something
  2. The CDF index seem to be shifted one hour earlier than the Seeq index. Is this due to possible time zone differences? Is the following then an accurate and robust way to align the indeces?
df_cdf.index = df_cdf.index.shift(1, "h").tz_localize("CET")

 

Aligning the indeces is just a technical matter - I am a bit more concerned about the mismatch in number of observations. CDF ingests Pi time series, and when recovering the raw data I would expect the indeces to be identical. Is there perhaps something I am missing here?

 

All help much appreciated! :)

Anders

Best answer by Håkon V. Treider

Time series data in CDF will always return UTC. In v2 of the service, the feature you are requesting might be available (to request data in a specific timezone).

In the meantime, please don’t convert timestamps manually. There are soooooo many non-intuitive traps to fall into. Since it seems you are already using pandas, let it do the conversion for you:

 

View original
Did this topic help you find an answer to your question?

7 replies

Stig Harald Gustavsen
Seasoned
Forum|alt.badge.img

Yeah, I also sett that. I use

datetime.utcnow()

and convert it manualy to my timezone by shifting the DF with one hour. then the code becomes independant on where ever the code is excecute from. 

But @Håkon V. Treider  has made two good guide for pyhton here: 

Where he on the the part 2 goes into how to do datetime stuff :) 


Stig Harald Gustavsen
Seasoned
Forum|alt.badge.img
#the dtype in pandas is in datetime64[ns] so when we view it in the np.int64 dtype we get the number og ns since unix epoc (1.1.1970)
#shifte datatime away from utc to local time zone
ns = df.index.view(np.int64) #converts it to unixepoc in nano seconds
ns += 1*60*60*10**9 #adjusting it 1 hours froward in nanoseconds
df.set_index(ns.view('datetime64[ns]'))

One way of doing it is this :) 

 


Anders Brakestad
Seasoned

Thanks. I looked at the guide, but I must admit I am not sure how to use it to help me now 😅

Am I correct to assume that all dates coming from the Cognite API are in UTC? And that it is not possible to pass timezone information to the API so that the returned dates actually correspond to what I actually was after?

So whenever I fetch a DataFrame from CDF I need to do a shift by one hour to manually convert from UTC to CET?


Stig Harald Gustavsen
Seasoned
Forum|alt.badge.img

Yeah, I just compare it to what i have available to me on the SCADA system or PI vision or some other place, and do the correction from that. 

Here is a code for getting the data from this current start of a production day, witch goes from 06:00 to 06:00 each day. 

dtn = datetime.now()
start = datetime(dtn.year,dtn.month,dtn.day,6)
end = 'now'

# This retreives data in utc atleast from the CDF project I'm working against
df = client.datapoints.retrieve(id=list_of_int_ids,tart=start, end=end).to_pandas()
#shifte datatime away from utc to local time zone witch is one our forward
ns = df.index.view(np.int64)
ns += 1*60*60*10**9 #adjusting it 1 hours froward such that timzones match.
df.set_index(ns.view('datetime64[ns]'))

But If there is away of getting the data straight from CDF in my local timezone by default that would be great to know. Hope some here can help for that, does anyone know if it can be automatically retrieve by using timezone environmental variables or something on the API / SDK level? 


Forum|alt.badge.img

Time series data in CDF will always return UTC. In v2 of the service, the feature you are requesting might be available (to request data in a specific timezone).

In the meantime, please don’t convert timestamps manually. There are soooooo many non-intuitive traps to fall into. Since it seems you are already using pandas, let it do the conversion for you:

 


Forum|alt.badge.img

Why does CDF return DateTimeIndex with dates before my start value?

I can safely say it doesn’t. Are you specifying data using `datetime` objects?

Edit: I see that you are. Python datetimes are always local time as per the language definiton of them (I guess 95% of people disagree with this, and I am one of them) anyway, we adhere to this. In the guide linked above there are a few ways to solve this, but the easiest is probably this (taken from the official Python SDK docs):

from datetime import datetime, timezone

start = datetime(2023, 1, 1, tzinfo=timezone.utc)


Anders Brakestad
Seasoned

Ahh, it was the 

df.tz_localize("UTC").tz_convert("CET")

pattern I was not able to come up with on my own :) Thanks! I’m trying to be very careful about the tz conversions, hence my question on this site. Thanks for the help!

Applying the tz conversion does align the Pi and CDF DateTimeIndex-s, apart from the extra timestamps present in the Pi DataFrame. If I drop those, then they match 100%. I suppose there is something with the ingestion pipeline that leads to some timestamps being missed…?

 


Reply


Cookie Policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie Settings