The timestamps mark the beginning of each “granularity period” (as you correctly inferred). You can read that and more here: https://developer.cognite.com/dev/concepts/aggregation/#aggregating-data-pointsDepending on what your sampling interval is, it may be possible to shift compared to PI. E.g. 1 hour can be replaced by 60 minutes, then with start you can control/shift the sampling period. See image example: Just a note regarding using .resample(self.sampling_interval).mean().interpolate() , this will compute the simple average (i.e. “sum of values / number of values”), instead of the time-weighted average returned by CDF (as this is a workaround I guess you already know this, but still worth mentioning I think!)...and a final note, “I have also played around with speeding up the raw data fetch by chunking the time periods and fetching with multiple threads or processes, but the speedup is not significant.”: All the retrieve endpoints for datapoints are quite heavily optimised alread
Just a heads up: The cognite-sdk is now avabile on conda-forge (and has been since Oct last year) 😄.
The error message is very specific here so I expect you have a NaN value for one of the regions (which in pandas world is equivalent to missing)
Hello @Håkon V. Treider , We tried implementing the given solution in our code as mentioned above. We have an existing multi-threading logic implemented in our code, when we use the del conc._THREAD_POOL_EXECUTOR_SINGLETON with that, we get the performance numbers as expected. But when we remove the multi-threading logic. The performance is not improving. We are using concurrent.futures.ThreadPoolExecutor for implementing multi-threading in our own code. Please suggest a way in which we can use the SDK calls alone with your solution, so that we can get the ideal performance. Since the SDK already parallelizes its calls for you, I think you should not wrap it inside of another thread pool executor. Maybe you can elaborate on why you use this pattern? If there is a need for this, please share some code and we’ll figure out how to make it as performant as you need together!
(...) Can you please explain what you meant by fetching aggregates of numeric data points? Assuming we have around 6000 time-series and half a million data points what should be the max workers to attain a performance of ~20 sec for full fetch? You should be able to run basically the same test as above to figure that out.My previous comment was about how to measure the performance of fetching aggregates: There are a total of 10 different aggregates you may fetch:average continuous_variance count discrete_variance interpolation max min step_interpolation sum total_variation→ The time it takes to fetch a single aggregate is about the same as fetching all.
Time for some actual benchmark code!It is surprisingly hard to do right, as we use a singleton pattern in the SDK for the thread pool executor that executes all the parallel calls. Hence, in order to resize the number of threads it has available, we need to explicitly re-create it. This probably explains why in your previous tests you saw little to no effect after altering various configuration options.The code below assumes a function exists that can provide you with a CogniteClient, - replace that with your own. For the tests, I use a previously created time series that has 5 million datapoints randomly distributed in time between 1970 and now.Note: It also assumes that you are testing performance for raw, (numeric) datapoints i.e. no aggregates. For aggregates, you need to decide how to count as fetching the 10 different aggregates is not that much slower than just fetching a single aggregate.import timefrom timeit import default_timer as timerimport cognite.client.utils._concurrenc
Thanks for sharing your benchmark code. I have a few comments that may help shed some light on performance.The default number of max workers in the SDK, 10, is a quite high value already. For comparison, other resource types (less performant than datapoints), the API will ignore a higher partitioning count than 10. I would suggest you do performance testing with 1, 2, 5, and 10 perhaps? ...and I don’t think you’ll need to touch the connection pool size at all. Whenever you are calling the API, you have a “burst budget” and a “sustained budget”; these are quite different. Btw, this is referred to as rate limiting, and is a vital component in protection of the API. After the initial burst budget is spent, the API will force you to slow down by replying with status code 429. Because these are automatically retried by the SDK after a self-imposed timeout (you may want to read up on the exponential backoff strategy with smearing which we use), you won’t notice this except from a lower throu
You mention the Python SDK in the title, but link to the official API documentation. Have you checked out the specific SDK documentation?https://cognite-sdk-python.readthedocs-hosted.com/en/latest/data_modeling.html#instancesWe’d love your feedback regarding which examples are unclear or which use cases are missing.
I only know the Python SDK, but I don’t see how you would easily do (1) without implementing some custom logic yourself since limit only looks forward in time.For the second question, this is basic functionality and well-supported! Link to relevant part of the documentation here:https://cognite-sdk-python.readthedocs-hosted.com/en/latest/core_data_model.html#retrieve-datapointsLet me know if you have any follow-up question 😄
When you say “Cognite SDK”, which are you referring to? E.g. there’s JavaScript, Python and more.
Hi @eashwar11 . There’s two steps here:Making sure the time series exist and are linked to specific assets. Writing datapoints to these time series.To tackle the first step, I sent you several links in my previous post. You can for example loop through df.columns and create TimeSeries objects that you then pass to time_series.create(...), something like the following:from cognite.client.data_classes import TimeSeriesnew_ts = [ TimeSeries( name=..., external_id=xid, asset_id=..., ..., ) xid for xid in df.columns]For step 2, you may simply use insert_dataframe once all time series have been created.
I have added some links to the documentation which has great examples for you to follow/copy/modify to your needs. If the column names of your data frame (df.columns) corresponds to the external ID of a time series, then you can...:If the time series is already created, use update (with the TimeSeriesUpdate object):https://cognite-sdk-python.readthedocs-hosted.com/en/latest/core_data_model.html#update-time-series If it doesn’t exist yet, add the asset link by passing asset_id when you create it (use the TimeSeries object).https://cognite-sdk-python.readthedocs-hosted.com/en/latest/core_data_model.html#create-time-series If you have several time series, some in need of updating and some to be created, you can use upsert:https://cognite-sdk-python.readthedocs-hosted.com/en/latest/core_data_model.html#upsert-time-series
In RAW, all columns are stored as a string. The documentation says:The columns object is a key value object, where the key corresponds to the column name while the value is the column value. It supports all the valid types of values in JSON, so number, string, array, and even nested JSON structure (see payload example to the right).I think the correct answer to the question is that there is no enforced schema, so you can write data with different types to the same column.
I like to use the token inspect endpoint, available at:token_obj = client.iam.token.inspect()It comes with a lot of info, but if you just want the CDF projects you have access to, copy this code:[proj.url_name for proj in token_obj.projects]
Btw, this is a good resource for known issues with Functions: https://docs.cognite.com/cdf/functions/known_issues
Without touching on the issue on whether the plotly package is popular enough to warrant default installation for every single user of CDF, here’s a tip to make custom installs less of a hassle! 😄Instead of importing micropip and using it to install, just invoke it directly:%pip install plotly nbformat
As the error message eludes to, most of the file system is read-only, but you do have write access to the /tmp directory.After checking the pandas documentation, I see that the .to_csv method also accepts a buffer, so you may do everything in memory and then use client.files.upload_bytes
From your posted error message, the CDF project settings still appears wrong (it says undefined, while I would expect it to be something like company-prod or company-dev). Please verify your connection details. For instance, if you read some parameters from a config file or an environment variable, double-check the value of these.
I am unable to reproduce this error, could you please run the following:print(client.version) From the “invalid URL” in the error message I can see that the Python SDK thinks your CDF project = “undefined”. Could you please verify your connection details? For instance by checking token/inspect endpoint:my_projects = client.iam.token.inspect().projectsprint([p.url_name for p in my_projects])
In Cognite Functions you still have access to some disk space under /tmp; use for example tempfile.TemporaryDirectory , then you don’t have to worry about where this location is.You may also as you say load it directly into memory, but this can be a memory issue for really large files (then do as @HaydenH suggests and store to disk first):import pandas as pdfrom io import BytesIOdf = pd.read_csv( BytesIO( client.files.download_bytes(external_id="foo") ))
Starting with version 6.6.1 of the SDK, some convenience methods were added to CogniteClient that I suspect might give you an easier way to instantiate a client:from cognite.client import CogniteClientclient = CogniteClient.default_oauth_client_credentials( project="rok-buandcollaborators-53", cdf_cluster="westeurope-1", tenant_id="0abcdef...", client_id="d9b2bd26...", client_secret="...", client_name="Who am I?",)# For easy interactive login, check out:CogniteClient.default_oauth_interactive
I could not iterate through the i.timeseries and read the datapoint from a very specific timeseries object named as 'PVOL'. That’s becuase .time_series is a method, not a property (i.e. you need to call it .time_series()) Another tip: This use case is so common that the SDK has several helper methods; here’s a suggestion assuming you know the identifier of you root asset:root_asset = client.assets.retrieve(external_id="my-root")all_time_series = root_asset.subtree().time_series()The nice part of this is that the final TimeSeriesList will be de-duplicated automatically for you.
Assuming you have the ID of your data set, it can be specified on asset instantiation:a1 = Asset(name=.., data_set_id=123) Another tip is the create_hierarchy method which simplifies the creation of asset hierarchies (will insert in correct topological order for you). It also supports upserting 👌
In @erlend.vollset code above, ts_data is an actual pandas DataFrame, which allows later pandas “data wrangling” code to succeed (in the test). I suspect one of your variables named “df_*” are still a mock.(Note that Erlend sets the return value to a TimeSeriesList. The later .get call then uses the actual get-method on the object (i.e. not some mocked method). Same goes with the .to_pandas call after that.)
Hi @eashwar11 ! It seems you are running with an older / outdated version of the Cognite Python SDK! Try upgrading and the problem should go away! 😊 Edit: @roman.chesnokov beat me to it
Already have an account? Login
Enter your username or e-mail address. We'll send you an e-mail with instructions to reset your password.
Sorry, we're still checking this file's contents to make sure it's safe to download. Please try again in a few minutes.
Sorry, our virus scanner detected that this file isn't safe to download.