Skip to main content

I am doing some hands-on exercises, but I am stuck in a point here.

The task is to Add some time series data:

- Create a time series object for each country asset in CDF called <country>_population and associate it with its corresponding country asset.

  •   Remember to associate the data as well to the data set that you created.
  •   As an example, the time series for Aruba would be called Aruba_population.

- Load the data from populations_postprocessed.csv into a pandas dataframe.
- Insert the data for each country in this dataframe using client.time_series.data.insert_dataframe.
- As a check, retrieve the latest population data for the countries of Latvia, Guatemala, and Benin.
- Calculate the total population of the Europe region using the Asset Hierarchy and the time series data.

The point I stuck on:

=> “Insert the data for each country in this dataframe using client.time_series.data.insert_dataframe”.

 

I am not sure what does this means, although I had done an early exercise …

Insert the population data as a dataframe:
client.time_series.data.insert_dataframe(df, external_id_headers=False)

 

Could you explain me how all of this works, please?

Hi! Sorry for the late reply.

I am not familiar with the exercise that you are doing, but I can try to explain how the `insert_dataframe` function behaves. It accepts a pandas data frame as the first argument. To create a pandas data frame from a csv file, you do the following:

import pandas as pd

df = pd.read_csv("population.csv")

To insert a data frame into CDF as a time series, the index must be a date time index, and the columns must be the external ids of time series that already exist.

Example

Let’s say that `population.csv` contains this:

      Country Name Country Code  Year     Value
0 Aruba ABW 1960 54608
1 Aruba ABW 1961 55811
2 Aruba ABW 1962 56682
3 Aruba ABW 1963 57475
4 Aruba ABW 1964 58178
... ... ... ... ...
16395 Zimbabwe ZWE 2017 14751101
16396 Zimbabwe ZWE 2018 15052184
16397 Zimbabwe ZWE 2019 15354608
16398 Zimbabwe ZWE 2020 15669666
16399 Zimbabwe ZWE 2021 15993524

/16400 rows x 4 columns]

If we want to create a time series for that shows the population in Norway, we can do the following:

df = pd.read_csv("population.csv")
# Extract only the rows for Norway
df = dfddfd"Country Name"] == "Norway"]

# Convert the index to a date time index. We read the time from the Year column.
df.index = pd.to_datetime(dfd'Year'], format='%Y')

# Create a time series
client.time_series.create(
TimeSeries(external_id="population_in_norway")
)

# Rename the column to the external id of our
dfd"population_in_norway"] = dfd"Value"]

# Remove the other columns
df = dfdf"population_in_norway"]]

# Insert the rows as data points
client.time_series.data.insert_dataframe(df)

# You can fetch the same data as a data frame like this:
client.time_series.data.retrieve_dataframe(
external_id="population_in_norway"
)

EDIT: There is more information on the python SDK website: https://cognite-sdk-python.readthedocs-hosted.com/en/latest/time_series.html#insert-pandas-dataframe


Reply