Solved

How to ensure identical DateTimeIndex in TS data from CDF and Pi (via Seeq API)

2 years ago
March 14, 2023
7 replies
122 views

Anders Brakestad
Seasoned
37 replies

I am trying to make a comparison between time series data from CDF and from Pi (accessed with the Seeq Python API). Some questions arise:

How do I ensure that the DateTimeIndex-s that are returned are identical?
Why does CDF return DateTimeIndex with dates before my start value?

Examples:

Pulling a raw time series from Pi via Seeq looks like this:

start = datetime(2021, 1, 1)
end = datetime(2021, 1, 2)
spy.pull(items, start=start, end=end, grid=None, quiet=True)

And the CDF analog looks like this:

start = datetime(2021, 1, 1)
end = datetime(2021, 1, 2)
cognite.datapoints.retrieve_dataframe(
    external_id=external_id,
    start=start,
    end=end,
    granularity=None
)

The resulting DateTimeIndex-es are not identical:

From Pi via Seeq:

DatetimeIndex(['2021-01-01 00:00:00+01:00', '2021-01-01 00:00:05+01:00',
               '2021-01-01 00:00:10+01:00', '2021-01-01 00:00:15+01:00',
               '2021-01-01 00:00:20+01:00', '2021-01-01 00:00:25+01:00',
               '2021-01-01 00:00:30+01:00', '2021-01-01 00:00:35+01:00',
               '2021-01-01 00:00:40+01:00', '2021-01-01 00:00:45+01:00',
               ...
               '2021-01-01 23:59:15+01:00', '2021-01-01 23:59:20+01:00',
               '2021-01-01 23:59:25+01:00', '2021-01-01 23:59:30+01:00',
               '2021-01-01 23:59:35+01:00', '2021-01-01 23:59:40+01:00',
               '2021-01-01 23:59:45+01:00', '2021-01-01 23:59:50+01:00',
               '2021-01-01 23:59:55+01:00', '2021-01-02 00:00:00+01:00'],
              dtype='datetime64[ns, CET]', length=17425, freq=None)

From CDF:

DatetimeIndex(['2020-12-31 23:00:00', '2020-12-31 23:00:05',
               '2020-12-31 23:00:10', '2020-12-31 23:00:15',
               '2020-12-31 23:00:20', '2020-12-31 23:00:25',
               '2020-12-31 23:00:30', '2020-12-31 23:00:35',
               '2020-12-31 23:00:40', '2020-12-31 23:00:45',
               ...
               '2021-01-01 22:59:10', '2021-01-01 22:59:15',
               '2021-01-01 22:59:20', '2021-01-01 22:59:25',
               '2021-01-01 22:59:30', '2021-01-01 22:59:35',
               '2021-01-01 22:59:40', '2021-01-01 22:59:45',
               '2021-01-01 22:59:50', '2021-01-01 22:59:55'],
              dtype='datetime64[ns]', length=17371, freq=None)

I notice here a couple of things:

The length of the time span is different, 17425 vs 17371.
The index from Seeq has CET time zone information, while the index from CDF is naïve. Looking at the source code in datapoints.py I fail to notice any explicit tz processing. The relevant code snippet is this

start = pd.Timestamp(min(q.start for q in fetcher.agg_queries), unit="ms")
end = pd.Timestamp(max(q.end for q in fetcher.agg_queries), unit="ms")
(granularity,) = grans_given
# Pandas understand "Cognite granularities" except `m` (minutes) which we must translate:
freq = cast(str, granularity).replace("m", "T")
return df.reindex(pd.date_range(start=start, end=end, freq=freq, inclusive="left"))

The CDF index contains data from before the start date specified. The docs says that start is inclusive, but I take that to mean that the interval [start, end) so that the dates need to be greater or equal to start and smaller than end. Am I missing something
The CDF index seem to be shifted one hour earlier than the Seeq index. Is this due to possible time zone differences? Is the following then an accurate and robust way to align the indeces?

df_cdf.index = df_cdf.index.shift(1, "h").tz_localize("CET")

Aligning the indeces is just a technical matter - I am a bit more concerned about the mismatch in number of observations. CDF ingests Pi time series, and when recovering the raw data I would expect the indeces to be identical. Is there perhaps something I am missing here?

All help much appreciated! :)

Anders

Best answer by Håkon V. Treider

Time series data in CDF will always return UTC. In v2 of the service, the feature you are requesting might be available (to request data in a specific timezone).

In the meantime, please don’t convert timestamps manually. There are soooooo many non-intuitive traps to fall into. Since it seems you are already using pandas, let it do the conversion for you:

View original

Did this topic help you find an answer to your question?

Stig Harald Gustavsen
Seasoned
26 replies
2 years ago
March 14, 2023

Yeah, I also sett that. I use

datetime.utcnow()

and convert it manualy to my timezone by shifting the DF with one hour. then the code becomes independant on where ever the code is excecute from.

But @Håkon V. Treider has made two good guide for pyhton here:

Cognites guide to code quality in Python - Part 2

Where he on the the part 2 goes into how to do datetime stuff :)

Stig

Stig Harald Gustavsen
Seasoned
26 replies
2 years ago
March 14, 2023

#the dtype in pandas is in datetime64[ns] so when we view it in the np.int64 dtype we get the number og ns since unix epoc (1.1.1970)
#shifte datatime away from utc to local time zone
ns = df.index.view(np.int64) #converts it to unixepoc in nano seconds
ns += 1*60*60*10**9 #adjusting it 1 hours froward in nanoseconds
df.set_index(ns.view('datetime64[ns]'))

One way of doing it is this :)

Stig

Anders Brakestad
Author
Seasoned
37 replies
2 years ago
March 14, 2023

Thanks. I looked at the guide, but I must admit I am not sure how to use it to help me now 😅

Am I correct to assume that all dates coming from the Cognite API are in UTC? And that it is not possible to pass timezone information to the API so that the returned dates actually correspond to what I actually was after?

So whenever I fetch a DataFrame from CDF I need to do a shift by one hour to manually convert from UTC to CET?

Stig Harald Gustavsen
Seasoned
26 replies
2 years ago
March 14, 2023

Yeah, I just compare it to what i have available to me on the SCADA system or PI vision or some other place, and do the correction from that.

Here is a code for getting the data from this current start of a production day, witch goes from 06:00 to 06:00 each day.

dtn = datetime.now()
start = datetime(dtn.year,dtn.month,dtn.day,6)
end = 'now'

# This retreives data in utc atleast from the CDF project I'm working against
df = client.datapoints.retrieve(id=list_of_int_ids,tart=start, end=end).to_pandas()
#shifte datatime away from utc to local time zone witch is one our forward
ns = df.index.view(np.int64)
ns += 1*60*60*10**9 #adjusting it 1 hours froward such that timzones match.
df.set_index(ns.view('datetime64[ns]'))

But If there is away of getting the data straight from CDF in my local timezone by default that would be great to know. Hope some here can help for that, does anyone know if it can be automatically retrieve by using timezone environmental variables or something on the API / SDK level?

Stig

Håkon V. Treider
Practitioner
81 replies
Answer
2 years ago
March 14, 2023

Time series data in CDF will always return UTC. In v2 of the service, the feature you are requesting might be available (to request data in a specific timezone).

In the meantime, please don’t convert timestamps manually. There are soooooo many non-intuitive traps to fall into. Since it seems you are already using pandas, let it do the conversion for you:

Håkon V. Treider
Practitioner
81 replies
2 years ago
March 14, 2023

Why does CDF return DateTimeIndex with dates before my start value?

I can safely say it doesn’t. Are you specifying data using `datetime` objects?

Edit: I see that you are. Python datetimes are always local time as per the language definiton of them (I guess 95% of people disagree with this, and I am one of them) anyway, we adhere to this. In the guide linked above there are a few ways to solve this, but the easiest is probably this (taken from the official Python SDK docs):

from datetime import datetime, timezone

start = datetime(2023, 1, 1, tzinfo=timezone.utc)

Anders Brakestad
Author
Seasoned
37 replies
2 years ago
March 14, 2023

Ahh, it was the

df.tz_localize("UTC").tz_convert("CET")

pattern I was not able to come up with on my own :) Thanks! I’m trying to be very careful about the tz conversions, hence my question on this site. Thanks for the help!

Applying the tz conversion does align the Pi and CDF DateTimeIndex-s, apart from the extra timestamps present in the Pi DataFrame. If I drop those, then they match 100%. I suppose there is something with the ingestion pipeline that leads to some timestamps being missed…?

Reply

Cookie Policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos

Reply

Related topics

Save data extracted with OPC UA extractor to raw staging area in CDF or link timeseries received to current CDF assets currenticon

Product Release Spotlight - December 2024 Release

How to run the OPC UA extractor with Docker [Cognite Official]

Product Release Spotlight - September 2024 Release

Try out the new Python SDK for Transformations

Sign up

Log in to the community

Scanning file for viruses.

This file cannot be downloaded

Cookie Policy

Cookie settings