Hi everyone,
We in are running a series where we are challenging the community to model a certain problem in CDF. The aim with this series is to facilitate discussion and invite the community members to share interesting solutions and techniques.
In this second entry of the series we will focus on the SpatioTemporal Asset Catalog (STAC).
The SpatioTemporal Asset Catalog (STAC) specification provides a common language to describe a range of geospatial information, so it can more easily be indexed and discovered. A 'spatiotemporal asset' is any file that represents information about the earth captured in a certain space and time.
The goal is for all providers of spatiotemporal assets (Imagery, SAR, Point Clouds, Data Cubes, Full Motion Video, etc) to expose their data as SpatioTemporal Asset Catalogs (STAC), so that new code doesn't need to be written whenever a new data set or API is released.
(source)
In C4IR Ocean, we are primarily dealing with geospatial data. The data originates from multiple providers and cover the whole globe. The ability to query data in space and time across data providers and datasets is crucial. This can be achieved in a single geospatial database, however with the volumes we are expecting to be working with (petabyte-scale), this becomes infeasible.
The SpatioTemporal Asset catalog is a very good candidate to solve this. It is a data catalog that will return references to datasets, which then can be downloaded and processed. Coupled with an integrated processing and/or analysis environment such as a managed notebook-server, this becomes very powerful.
STAC consist of 3 base resource types:
- Items - GeoJSON-like feature with additional fields for attributes like time and links assets.
- Catalog - Basic structure to link STAC Items together
- Collection - Extended catalog with additional information about space/time coverage items and sub-collections.
In addition to be base types, the STAC specification provides support of extensions, enabling users to add features such as datacube-information (data dimension information), alternative geographic reference systems, domain-specific information such as electric grid information and much more.
Microsoft Planetary Computer (link) has demonstrated how this can run at scale, hosting nearly 25 Petabytes of data in their catalog. Their solution is running a Postgres/PostGIS implementation called pgstac, together with an API (stac-fastapi), and hosts data in azure storage blobs.
Our challenge to the CDF community is simple - How would you model the STAC Specification in CDF? Which APIs would you use and why?
We are inviting anyone in the community to participate in the challenge. There is no limit on what APIs to use. Both general availability APIs (V1) or playground APIs are welcome.
Keep in mind the geospatial requirement of the specification. CDF does have an experimental geospatial API, which likely would have to be combined with other resource types. We would also like to mention that CDF Files do have geospatial support. This could also be helpful.
Looking forward to seeing what you come up with :)