add docstring to load_stac function #1434

JeroenVerstraelen · 2025-11-26T23:02:53Z

Added a docstring to the load_stac function to clarify its usage.

The docstring focuses on aspects that weren’t immediately clear when reading the code for the first time, with the goal of helping users understand the function and its parameters without needing to dive into the implementation. I hope it can also help during refactoring to keep a bird's eye view of the total inputs, outputs, and goals of the function.

soxofaan

some notes

soxofaan · 2025-11-27T08:00:57Z

openeogeotrellis/load_stac.py

    feature_flags: Optional[Dict[str, Any]] = None,
 ) -> GeopysparkDataCube:
+    """
+    Load a STAC resource (Collection, Item, Catalog, or batch job results) and return


Suggested change

Load a STAC resource (Collection, Item, Catalog, or batch job results) and return

Load a STAC resource (STAC Collection/Catalog, STAC Item, STAC API, or openEO batch job results) and return

I noticed that I can get easily confused around the code branches when it comes to the url parameter, especially with the term "STAC API". So I went in deep through the specification to clarify things for myself again.

I'd like to focus on the fact that the url should always return a JSON object. We need to be very specific with the terminology we use for the different types of JSON objects we support.
So that it is always crystal clear to us what different types of urls there are, as this is where most of the initial branching of the code occurs. It's also what the caller of the load_stac function likely has the most questions about, so we should be able to clarify in the docstring.

According to the STAC spec there are three main JSON objects types:

Catalog

Collection

Item

Then within the documentation of the STAC API they mix around the terms "STAC API", "STAC API Catalog", "Dynamic Catalog".

"A STAC API is a dynamic version of a SpatioTemporal Asset Catalog."

"Dynamic catalogs often also implement the STAC API specification, ... But they are not required to."

So I'd like to define these terms:

STAC API

A web service implementing at least the STAC API - Core specification.

STAC API Landing page

The JSON object returned from the root endpoint of a STAC API.

This JSON object:

Is of type STAC Catalog or STAC Collection.

Must contain the "conformsTo" field listing supported conformance classes.

Can include links to API endpoints (e.g., /search, /collections) according to conformance classes.

This is sometimes called a “STAC API Catalog” in documentation, but the term in the specification is "landingPage".

STAC Sub-catalog

A descendant of the root catalog following the child links.

Can also include a "conformsTo" field different from the root, e.g. if each sub-catalog has its own database and so its own search endpoint.

We should not support this until it's absolutely required for some usecase.

Therefore a sub-catalog can never be a STAC API Catalog in our implementation.

STAC API Specification

A list of OpenAPI specification documents defining RESTful endpoints.

Dynamic catalog

A JSON object of type STAC catalog where the links lead to dynamically generated content.

Static Catalog

A JSON object of type STAC catalog where the links lead to statically hosted content.

With these definitions I'd like to categorize the four different types of JSON endpoints that the url parameter can point to:

STAC API Landing page (= STACApiCatalog)

STAC Item

STAC Collection

Own openEO batch job endpoint

Any other endpoint returning a STAC collection object.

STAC Catalog

Notes:

Currently in our implementation only a STAC Collection can be upgraded to a STACApiCatalog. We should leave room if we ever need to support STAC Catalogs too.

The "own openEO batch job" endpoint always points to a static STAC collection. So even if we fetch the data from the jobregistry here, we should classify it as a STAC collection.

This avoids additional branching, any fixes we do to static STAC collections (such as intersection changes, item metadata, etc.) also apply to openEO jobs.

So written out as a class hierarchy:

stac.py

"""Adapter pattern around the Pystac library, we clearly enumerate the only classes that load_stac needs to worry about.""" class StacObject() def from_url(url) def to_metadata() -> GeopysparkCubeMetadata class Catalog(StacObject): def to_item_collection(spatiotemporal_extent) def from_batch_job(dependency_job_info, batch_jobs) # Isa catalog to support older openEO versions. class Collection(Catalog): def to_item_collection(spatiotemporal_extent) class Item(StacObject): def to_item_collection(spatiotemporal_extent) class APICollection(Collection): def fetch_item_collection(spatiotemporal_extent, property_filter) class ZarrCollection(Collection): """Shows how support for future inputs can be added.""" def to_item_collection(spatiotemporal_extent)

def get_stac_object(url, dependency_job_info, batch_jobs) -> StacObject: if dependency_job_info and batch_jobs: collection = stac.Collection.from_batch_job(dependency_job_info, batch_jobs) return collection stac_object = _await_stac_object(url) if isinstance(stac_object, pystac.Item) item = stac.Item.from_url(url) return item elif isinstance(stac_object, pystac.Collection) if _supports_item_search(stac_object): api_collection = stac.APICollection.from_url(url) return api_collection collection = stac.Collection.from_url(url) return collection else: catalog = stac.Catalog.from_url(url) return catalog def construct_item_collection(stac_object, property_filter, spatiotemporal_extent) -> ItemCollection: if isinstance(stac_object, APICollection): return stac_object.fetch_item_collection(spatiotemporal_extent, property_filter) return stac_object.to_item_collection(spatiotemporal_extent) def load_stac(url): # 1. Construct feature database stac_object = get_stac_object(url) item_collection = construct_item_collection(stac_object, property_filter, spatiotemporal_extent) opensearch_client = construct_opensearch_client(item_collection) # 2. Update metadata based on STAC information metadata = stac_object.to_metadata() metadata = item_collection.update_metadata(metadata) # 3. Collection metadata is complete, handle user request metadata = update_metadata(load_params) stac_bbox = item_collection.get_bbox() target_bbox = requested_bbox or stac_bbox resolution = calculate_resolution(load_params, feature_flags, target_bbox) # calculate resolution # 3. Load data # construct pyramid_factory # call datacube_seq # return GeopysparkDataCube

soxofaan · 2025-11-27T08:02:19Z

openeogeotrellis/load_stac.py

+    it as a GeopysparkDataCube. Based on filters defined in load_params and layer_properties.
+
+    :param url: STAC URL to load from. Can point to:
+        - **STAC Collection**:


Suggested change

- **STAC Collection**:

- **STAC API**:

soxofaan · 2025-11-27T08:02:40Z

openeogeotrellis/load_stac.py

+          A single item is used if it intersects the requested spatiotemporal
+          extent.
+
+        - **STAC Catalog**:


Suggested change

- **STAC Catalog**:

- **STAC Catalog** or **STAC Collection**:

soxofaan · 2025-11-27T08:06:28Z

openeogeotrellis/load_stac.py

+          items that intersect the requested spatiotemporal extent.
+          Multiple collections may contribute items to the resulting cube.
+
+        - **Dependency Job**:


Suggested change

- **Dependency Job**:

- **Own openEO Batch Job**:

That's better indeed, I think we should also rename "dependency job" in the code because when I first saw just the name without the implementation it confused me at first.

yes, I also find "dependency job" confusing terminology, but did not address that yet as it is the least relevant code path in this refactor

soxofaan · 2025-11-27T08:14:05Z

openeogeotrellis/load_stac.py

+          `batch_jobs` are provided and the user owns the job, the job results
+          catalog is fetched. While fetching, it will wait for the job to


Suggested change

`batch_jobs` are provided and the user owns the job, the job results

catalog is fetched. While fetching, it will wait for the job to

`batch_jobs` are provided and the user owns the job, the job status

is fetched. It will wait for the job to

I meant here that it will literally fetch the job results. Which, as far as the user is concerned, should be a valid STAC catalog containing STAC items.

If we want it closer to the truth:

`batch_jobs` are provided and the user owns the job, the job result STAC items are fetched. It will wait for the job to...

I think both could work:

`batch_jobs` are provided and the user owns the job, the job result STAC collection is fetched. It will wait for the job to...

it will literally fetch the job results

Maybe you don't mean it like that, but that sounds like it actually downloads all the assets at that point, which is not the case.
It just gets/polls the batch job metadata document (GET /jobs/{job_id}) to get the job status and actively waits until that status reaches finished/error. No "results" are fetched before that (there aren't any yet anyway).

Also: at the moment, the logic doesn't even assume the metadata document is a STAC collection (to be compatible with openeo<1.0 I think). The only assumption is that there are STAC assets linked from the document.
There is also no assumption about STAC items or STAC item links

soxofaan · 2025-11-27T08:15:55Z

openeogeotrellis/load_stac.py

+          `batch_jobs` are provided and the user owns the job, the job results
+          catalog is fetched. While fetching, it will wait for the job to
+          complete if necessary (up to a configured maximum).
+          Only items from the job results catalog intersecting the requested


Suggested change

Only items from the job results catalog intersecting the requested

Only assets from the job results intersecting the requested

We're currently using job.get_result_assets() and converting them to pystac.Item before checking intersection. But get_results_assets is deprecated and we should use get_result_metadata instead, and the openEO job metadata should be a valid STAC collection with STAC items.

Afaict we do intersection checks at pystac.Item level for all other types of STAC resources, so it feels weird to say we do it as asset level here.

openEO job metadata should be a valid STAC collection with STAC items.

That's unfortunately not the case in openEO below version 2.0 (which does not exist yet).
from the GET /jobs/{job_id}/results spec :

It is a valid STAC Item (if it has spatial and temporal references included) or a valid STAC Collection (supported since openEO API version 1.1.0).
...
To maintain backward compatibility, it is REQUIRED to still copy all assets in the STAC catalog structure into the collection-level assets. This requirement is planned to be removed in openEO API version 2.0.0.

The fact that we convert assets to items is indeed a bit weird, but I think that is just a temporary implementation detail (because of the spec constraints mentioned above). I think that the docstring should not go too much into detail here because it's unfortunately a confusing situation

soxofaan · 2025-11-27T08:18:20Z

openeogeotrellis/load_stac.py

+                }
+            }
+
+    :param batch_jobs: Optional job registry used to resolve dependency-job


Suggested change

:param batch_jobs: Optional job registry used to resolve dependency-job

:param batch_jobs: Optional batch job API implementation used to resolve dependency-job

(also a bit vague, but it any case this parameter is not (just) about job registry)

But in the code batch_jobs is only used to get information from the job registry.

batch_jobs.get_job_info(job_id=job_id, user_id=user_id) batch_jobs.get_result_assets(job_id=job.id, user_id=user.user_id)

We might have to pass something smaller here. Because we're passing an object with the following interface, while we only need some basic metadata from the job. As far as a user of load_stac can tell, someone could call delete_job or delete_batch_process_results when he passes this object.

__init__ set_default_sentinel_hub_credentials set_terrascope_access_token_getter create_job get_job_info poll_job_dependencies get_user_jobs get_job_output_dir get_job_work_dir start_job _start_job _determine_container_image_from_process_graph _schedule_and_get_dependencies get_result_metadata get_result_assets _results_metadata_to_assets get_results_metadata_path load_results_metadata _load_results_metadata_from_file _get_providers get_log_entries cancel_job delete_job _delete_job delete_batch_process_results delete_batch_process_dependency_sources

yes, I'd also like to rework the "own job" flow a bit,
but as mentioned, this is the least relevant/urgent part of the load_stac refactor, so I would just stick to documenting the current state

soxofaan · 2025-11-27T08:19:00Z

openeogeotrellis/load_stac.py

+
+    :param stac_io: Optional custom STAC I/O handler (e.g., to enable
+        authentication or caching).
+        TODO: Currently never provided in any load_stac call.


Suggested change

TODO: Currently never provided in any load_stac call.

It is provided from CWL use cases from custom_processes in CDSE

openeogeotrellis/load_stac.py

add docstring to load_stac function

341ec5d

JeroenVerstraelen requested a review from soxofaan November 26, 2025 23:02

soxofaan reviewed Nov 27, 2025

View reviewed changes

JeroenVerstraelen self-assigned this Nov 27, 2025

	Load a STAC resource (Collection, Item, Catalog, or batch job results) and return
	Load a STAC resource (STAC Collection/Catalog, STAC Item, STAC API, or openEO batch job results) and return

	- STAC Catalog:
	- STAC Catalog or STAC Collection:

		`batch_jobs` are provided and the user owns the job, the job results
		catalog is fetched. While fetching, it will wait for the job to

	Only items from the job results catalog intersecting the requested
	Only assets from the job results intersecting the requested

	:param batch_jobs: Optional job registry used to resolve dependency-job
	:param batch_jobs: Optional batch job API implementation used to resolve dependency-job

add docstring to load_stac function #1434

Are you sure you want to change the base?

add docstring to load_stac function #1434

Conversation

JeroenVerstraelen commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

soxofaan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JeroenVerstraelen Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

soxofaan Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

soxofaan Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JeroenVerstraelen commented Nov 26, 2025 •

edited

Loading

JeroenVerstraelen Nov 28, 2025 •

edited

Loading

soxofaan Nov 28, 2025 •

edited

Loading

soxofaan Nov 27, 2025 •

edited

Loading