Skip to content

Conversation

@JeroenVerstraelen
Copy link
Contributor

@JeroenVerstraelen JeroenVerstraelen commented Nov 26, 2025

Added a docstring to the load_stac function to clarify its usage.

The docstring focuses on aspects that weren’t immediately clear when reading the code for the first time, with the goal of helping users understand the function and its parameters without needing to dive into the implementation. I hope it can also help during refactoring to keep a bird's eye view of the total inputs, outputs, and goals of the function.

Copy link
Member

@soxofaan soxofaan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some notes

feature_flags: Optional[Dict[str, Any]] = None,
) -> GeopysparkDataCube:
"""
Load a STAC resource (Collection, Item, Catalog, or batch job results) and return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Load a STAC resource (Collection, Item, Catalog, or batch job results) and return
Load a STAC resource (STAC Collection/Catalog, STAC Item, STAC API, or openEO batch job results) and return

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that I can get easily confused around the code branches when it comes to the url parameter, especially with the term "STAC API". So I went in deep through the specification to clarify things for myself again.

I'd like to focus on the fact that the url should always return a JSON object. We need to be very specific with the terminology we use for the different types of JSON objects we support.
So that it is always crystal clear to us what different types of urls there are, as this is where most of the initial branching of the code occurs. It's also what the caller of the load_stac function likely has the most questions about, so we should be able to clarify in the docstring.

According to the STAC spec there are three main JSON objects types:

  • Catalog
  • Collection
  • Item

Then within the documentation of the STAC API they mix around the terms "STAC API", "STAC API Catalog", "Dynamic Catalog".

  • "A STAC API is a dynamic version of a SpatioTemporal Asset Catalog."
  • "Dynamic catalogs often also implement the STAC API specification, ... But they are not required to."

So I'd like to define these terms:

  • STAC API
    • A web service implementing at least the STAC API - Core specification.
  • STAC API Landing page
    • The JSON object returned from the root endpoint of a STAC API.
    • This JSON object:
      • Is of type STAC Catalog or STAC Collection.
      • Must contain the "conformsTo" field listing supported conformance classes.
      • Can include links to API endpoints (e.g., /search, /collections) according to conformance classes.
    • This is sometimes called a “STAC API Catalog” in documentation, but the term in the specification is "landingPage".
  • STAC Sub-catalog
    • A descendant of the root catalog following the child links.
    • Can also include a "conformsTo" field different from the root, e.g. if each sub-catalog has its own database and so its own search endpoint.
      • We should not support this until it's absolutely required for some usecase.
        • Therefore a sub-catalog can never be a STAC API Catalog in our implementation.
  • STAC API Specification
    • A list of OpenAPI specification documents defining RESTful endpoints.
  • Dynamic catalog
    • A JSON object of type STAC catalog where the links lead to dynamically generated content.
  • Static Catalog
    • A JSON object of type STAC catalog where the links lead to statically hosted content.

With these definitions I'd like to categorize the four different types of JSON endpoints that the url parameter can point to:

  • STAC API Landing page (= STACApiCatalog)
  • STAC Item
  • STAC Collection
    • Own openEO batch job endpoint
    • Any other endpoint returning a STAC collection object.
  • STAC Catalog

Notes:

  • Currently in our implementation only a STAC Collection can be upgraded to a STACApiCatalog. We should leave room if we ever need to support STAC Catalogs too.
  • The "own openEO batch job" endpoint always points to a static STAC collection. So even if we fetch the data from the jobregistry here, we should classify it as a STAC collection.
    • This avoids additional branching, any fixes we do to static STAC collections (such as intersection changes, item metadata, etc.) also apply to openEO jobs.

So written out as a class hierarchy:

stac.py

"""Adapter pattern around the Pystac library, we clearly enumerate the only classes that load_stac needs to worry about."""

class StacObject()
    def from_url(url)
    def to_metadata() -> GeopysparkCubeMetadata

class Catalog(StacObject):
    def to_item_collection(spatiotemporal_extent)
    def from_batch_job(dependency_job_info, batch_jobs) # Isa catalog to support older openEO versions.

class Collection(Catalog):
    def to_item_collection(spatiotemporal_extent)
    
class Item(StacObject):
    def to_item_collection(spatiotemporal_extent)
    
class APICollection(Collection):
    def fetch_item_collection(spatiotemporal_extent, property_filter)
    
class ZarrCollection(Collection):
    """Shows how support for future inputs can be added."""
    def to_item_collection(spatiotemporal_extent)
def get_stac_object(url, dependency_job_info, batch_jobs) -> StacObject:
	if dependency_job_info and batch_jobs:
		collection = stac.Collection.from_batch_job(dependency_job_info, batch_jobs)
		return collection
	stac_object = _await_stac_object(url)
	if isinstance(stac_object, pystac.Item)
		item = stac.Item.from_url(url)
		return item
	elif isinstance(stac_object, pystac.Collection)
		if _supports_item_search(stac_object):
			api_collection = stac.APICollection.from_url(url)
			return api_collection
		collection = stac.Collection.from_url(url)
		return collection
	else:
		catalog = stac.Catalog.from_url(url)
		return catalog

def construct_item_collection(stac_object, property_filter, spatiotemporal_extent) -> ItemCollection:
	if isinstance(stac_object, APICollection):
		return stac_object.fetch_item_collection(spatiotemporal_extent, property_filter)
	return stac_object.to_item_collection(spatiotemporal_extent)
	

def load_stac(url):
	# 1. Construct feature database
	stac_object = get_stac_object(url)
	item_collection = construct_item_collection(stac_object, property_filter, spatiotemporal_extent)
	opensearch_client = construct_opensearch_client(item_collection)
	
	# 2. Update metadata based on STAC information
	metadata = stac_object.to_metadata()
	metadata = item_collection.update_metadata(metadata)

	
	# 3. Collection metadata is complete, handle user request
	metadata = update_metadata(load_params)
	stac_bbox = item_collection.get_bbox()
	target_bbox = requested_bbox or stac_bbox
	resolution = calculate_resolution(load_params, feature_flags, target_bbox)
	
	# calculate resolution
	
	# 3. Load data
	# construct pyramid_factory
	# call datacube_seq
	# return GeopysparkDataCube
	

it as a GeopysparkDataCube. Based on filters defined in load_params and layer_properties.
:param url: STAC URL to load from. Can point to:
- **STAC Collection**:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **STAC Collection**:
- **STAC API**:

A single item is used if it intersects the requested spatiotemporal
extent.
- **STAC Catalog**:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **STAC Catalog**:
- **STAC Catalog** or **STAC Collection**:

items that intersect the requested spatiotemporal extent.
Multiple collections may contribute items to the resulting cube.
- **Dependency Job**:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Dependency Job**:
- **Own openEO Batch Job**:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's better indeed, I think we should also rename "dependency job" in the code because when I first saw just the name without the implementation it confused me at first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I also find "dependency job" confusing terminology, but did not address that yet as it is the least relevant code path in this refactor

Comment on lines +120 to +121
`batch_jobs` are provided and the user owns the job, the job results
catalog is fetched. While fetching, it will wait for the job to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`batch_jobs` are provided and the user owns the job, the job results
catalog is fetched. While fetching, it will wait for the job to
`batch_jobs` are provided and the user owns the job, the job status
is fetched. It will wait for the job to

Copy link
Contributor Author

@JeroenVerstraelen JeroenVerstraelen Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant here that it will literally fetch the job results. Which, as far as the user is concerned, should be a valid STAC catalog containing STAC items.

If we want it closer to the truth:

`batch_jobs` are provided and the user owns the job, the job result STAC items are fetched. It will wait for the job to...

I think both could work:

`batch_jobs` are provided and the user owns the job, the job result STAC collection is fetched. It will wait for the job to...

Copy link
Member

@soxofaan soxofaan Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will literally fetch the job results

Maybe you don't mean it like that, but that sounds like it actually downloads all the assets at that point, which is not the case.
It just gets/polls the batch job metadata document (GET /jobs/{job_id}) to get the job status and actively waits until that status reaches finished/error. No "results" are fetched before that (there aren't any yet anyway).

Also: at the moment, the logic doesn't even assume the metadata document is a STAC collection (to be compatible with openeo<1.0 I think). The only assumption is that there are STAC assets linked from the document.
There is also no assumption about STAC items or STAC item links

`batch_jobs` are provided and the user owns the job, the job results
catalog is fetched. While fetching, it will wait for the job to
complete if necessary (up to a configured maximum).
Only items from the job results catalog intersecting the requested
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Only items from the job results catalog intersecting the requested
Only assets from the job results intersecting the requested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're currently using job.get_result_assets() and converting them to pystac.Item before checking intersection. But get_results_assets is deprecated and we should use get_result_metadata instead, and the openEO job metadata should be a valid STAC collection with STAC items.

Afaict we do intersection checks at pystac.Item level for all other types of STAC resources, so it feels weird to say we do it as asset level here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openEO job metadata should be a valid STAC collection with STAC items.

That's unfortunately not the case in openEO below version 2.0 (which does not exist yet).
from the GET /jobs/{job_id}/results spec :

It is a valid STAC Item (if it has spatial and temporal references included) or a valid STAC Collection (supported since openEO API version 1.1.0).
...
To maintain backward compatibility, it is REQUIRED to still copy all assets in the STAC catalog structure into the collection-level assets. This requirement is planned to be removed in openEO API version 2.0.0.

The fact that we convert assets to items is indeed a bit weird, but I think that is just a temporary implementation detail (because of the spec constraints mentioned above). I think that the docstring should not go too much into detail here because it's unfortunately a confusing situation

}
}
:param batch_jobs: Optional job registry used to resolve dependency-job
Copy link
Member

@soxofaan soxofaan Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:param batch_jobs: Optional job registry used to resolve dependency-job
:param batch_jobs: Optional batch job API implementation used to resolve dependency-job

(also a bit vague, but it any case this parameter is not (just) about job registry)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in the code batch_jobs is only used to get information from the job registry.

batch_jobs.get_job_info(job_id=job_id, user_id=user_id)
batch_jobs.get_result_assets(job_id=job.id, user_id=user.user_id)

We might have to pass something smaller here. Because we're passing an object with the following interface, while we only need some basic metadata from the job. As far as a user of load_stac can tell, someone could call delete_job or delete_batch_process_results when he passes this object.

__init__
set_default_sentinel_hub_credentials
set_terrascope_access_token_getter
create_job
get_job_info
poll_job_dependencies
get_user_jobs
get_job_output_dir
get_job_work_dir
start_job
_start_job
_determine_container_image_from_process_graph
_schedule_and_get_dependencies
get_result_metadata
get_result_assets
_results_metadata_to_assets
get_results_metadata_path
load_results_metadata
_load_results_metadata_from_file
_get_providers
get_log_entries
cancel_job
delete_job
_delete_job
delete_batch_process_results
delete_batch_process_dependency_sources

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'd also like to rework the "own job" flow a bit,
but as mentioned, this is the least relevant/urgent part of the load_stac refactor, so I would just stick to documenting the current state

:param stac_io: Optional custom STAC I/O handler (e.g., to enable
authentication or caching).
TODO: Currently never provided in any load_stac call.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TODO: Currently never provided in any load_stac call.

It is provided from CWL use cases from custom_processes in CDSE

@JeroenVerstraelen JeroenVerstraelen self-assigned this Nov 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants