-
Notifications
You must be signed in to change notification settings - Fork 9
add docstring to load_stac function #1434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
soxofaan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some notes
| feature_flags: Optional[Dict[str, Any]] = None, | ||
| ) -> GeopysparkDataCube: | ||
| """ | ||
| Load a STAC resource (Collection, Item, Catalog, or batch job results) and return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Load a STAC resource (Collection, Item, Catalog, or batch job results) and return | |
| Load a STAC resource (STAC Collection/Catalog, STAC Item, STAC API, or openEO batch job results) and return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that I can get easily confused around the code branches when it comes to the url parameter, especially with the term "STAC API". So I went in deep through the specification to clarify things for myself again.
I'd like to focus on the fact that the url should always return a JSON object. We need to be very specific with the terminology we use for the different types of JSON objects we support.
So that it is always crystal clear to us what different types of urls there are, as this is where most of the initial branching of the code occurs. It's also what the caller of the load_stac function likely has the most questions about, so we should be able to clarify in the docstring.
According to the STAC spec there are three main JSON objects types:
- Catalog
- Collection
- Item
Then within the documentation of the STAC API they mix around the terms "STAC API", "STAC API Catalog", "Dynamic Catalog".
- "A STAC API is a dynamic version of a SpatioTemporal Asset Catalog."
- "Dynamic catalogs often also implement the STAC API specification, ... But they are not required to."
So I'd like to define these terms:
- STAC API
- A web service implementing at least the STAC API - Core specification.
- STAC API Landing page
- The JSON object returned from the root endpoint of a STAC API.
- This JSON object:
- Is of type STAC Catalog or STAC Collection.
- Must contain the "conformsTo" field listing supported conformance classes.
- Can include links to API endpoints (e.g., /search, /collections) according to conformance classes.
- This is sometimes called a “STAC API Catalog” in documentation, but the term in the specification is "landingPage".
- STAC Sub-catalog
- A descendant of the root catalog following the
childlinks. - Can also include a "conformsTo" field different from the root, e.g. if each sub-catalog has its own database and so its own search endpoint.
- We should not support this until it's absolutely required for some usecase.
- Therefore a sub-catalog can never be a STAC API Catalog in our implementation.
- We should not support this until it's absolutely required for some usecase.
- A descendant of the root catalog following the
- STAC API Specification
- A list of OpenAPI specification documents defining RESTful endpoints.
- Dynamic catalog
- A JSON object of type STAC catalog where the links lead to dynamically generated content.
- Static Catalog
- A JSON object of type STAC catalog where the links lead to statically hosted content.
With these definitions I'd like to categorize the four different types of JSON endpoints that the url parameter can point to:
- STAC API Landing page (= STACApiCatalog)
- STAC Item
- STAC Collection
- Own openEO batch job endpoint
- Any other endpoint returning a STAC collection object.
- STAC Catalog
Notes:
- Currently in our implementation only a STAC Collection can be upgraded to a STACApiCatalog. We should leave room if we ever need to support STAC Catalogs too.
- The "own openEO batch job" endpoint always points to a static STAC collection. So even if we fetch the data from the jobregistry here, we should classify it as a STAC collection.
- This avoids additional branching, any fixes we do to static STAC collections (such as intersection changes, item metadata, etc.) also apply to openEO jobs.
So written out as a class hierarchy:
stac.py
"""Adapter pattern around the Pystac library, we clearly enumerate the only classes that load_stac needs to worry about."""
class StacObject()
def from_url(url)
def to_metadata() -> GeopysparkCubeMetadata
class Catalog(StacObject):
def to_item_collection(spatiotemporal_extent)
def from_batch_job(dependency_job_info, batch_jobs) # Isa catalog to support older openEO versions.
class Collection(Catalog):
def to_item_collection(spatiotemporal_extent)
class Item(StacObject):
def to_item_collection(spatiotemporal_extent)
class APICollection(Collection):
def fetch_item_collection(spatiotemporal_extent, property_filter)
class ZarrCollection(Collection):
"""Shows how support for future inputs can be added."""
def to_item_collection(spatiotemporal_extent)
def get_stac_object(url, dependency_job_info, batch_jobs) -> StacObject:
if dependency_job_info and batch_jobs:
collection = stac.Collection.from_batch_job(dependency_job_info, batch_jobs)
return collection
stac_object = _await_stac_object(url)
if isinstance(stac_object, pystac.Item)
item = stac.Item.from_url(url)
return item
elif isinstance(stac_object, pystac.Collection)
if _supports_item_search(stac_object):
api_collection = stac.APICollection.from_url(url)
return api_collection
collection = stac.Collection.from_url(url)
return collection
else:
catalog = stac.Catalog.from_url(url)
return catalog
def construct_item_collection(stac_object, property_filter, spatiotemporal_extent) -> ItemCollection:
if isinstance(stac_object, APICollection):
return stac_object.fetch_item_collection(spatiotemporal_extent, property_filter)
return stac_object.to_item_collection(spatiotemporal_extent)
def load_stac(url):
# 1. Construct feature database
stac_object = get_stac_object(url)
item_collection = construct_item_collection(stac_object, property_filter, spatiotemporal_extent)
opensearch_client = construct_opensearch_client(item_collection)
# 2. Update metadata based on STAC information
metadata = stac_object.to_metadata()
metadata = item_collection.update_metadata(metadata)
# 3. Collection metadata is complete, handle user request
metadata = update_metadata(load_params)
stac_bbox = item_collection.get_bbox()
target_bbox = requested_bbox or stac_bbox
resolution = calculate_resolution(load_params, feature_flags, target_bbox)
# calculate resolution
# 3. Load data
# construct pyramid_factory
# call datacube_seq
# return GeopysparkDataCube
| it as a GeopysparkDataCube. Based on filters defined in load_params and layer_properties. | ||
| :param url: STAC URL to load from. Can point to: | ||
| - **STAC Collection**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **STAC Collection**: | |
| - **STAC API**: |
| A single item is used if it intersects the requested spatiotemporal | ||
| extent. | ||
| - **STAC Catalog**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **STAC Catalog**: | |
| - **STAC Catalog** or **STAC Collection**: |
| items that intersect the requested spatiotemporal extent. | ||
| Multiple collections may contribute items to the resulting cube. | ||
| - **Dependency Job**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **Dependency Job**: | |
| - **Own openEO Batch Job**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's better indeed, I think we should also rename "dependency job" in the code because when I first saw just the name without the implementation it confused me at first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I also find "dependency job" confusing terminology, but did not address that yet as it is the least relevant code path in this refactor
| `batch_jobs` are provided and the user owns the job, the job results | ||
| catalog is fetched. While fetching, it will wait for the job to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `batch_jobs` are provided and the user owns the job, the job results | |
| catalog is fetched. While fetching, it will wait for the job to | |
| `batch_jobs` are provided and the user owns the job, the job status | |
| is fetched. It will wait for the job to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant here that it will literally fetch the job results. Which, as far as the user is concerned, should be a valid STAC catalog containing STAC items.
If we want it closer to the truth:
`batch_jobs` are provided and the user owns the job, the job result STAC items are fetched. It will wait for the job to...
I think both could work:
`batch_jobs` are provided and the user owns the job, the job result STAC collection is fetched. It will wait for the job to...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will literally fetch the job results
Maybe you don't mean it like that, but that sounds like it actually downloads all the assets at that point, which is not the case.
It just gets/polls the batch job metadata document (GET /jobs/{job_id}) to get the job status and actively waits until that status reaches finished/error. No "results" are fetched before that (there aren't any yet anyway).
Also: at the moment, the logic doesn't even assume the metadata document is a STAC collection (to be compatible with openeo<1.0 I think). The only assumption is that there are STAC assets linked from the document.
There is also no assumption about STAC items or STAC item links
| `batch_jobs` are provided and the user owns the job, the job results | ||
| catalog is fetched. While fetching, it will wait for the job to | ||
| complete if necessary (up to a configured maximum). | ||
| Only items from the job results catalog intersecting the requested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Only items from the job results catalog intersecting the requested | |
| Only assets from the job results intersecting the requested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're currently using job.get_result_assets() and converting them to pystac.Item before checking intersection. But get_results_assets is deprecated and we should use get_result_metadata instead, and the openEO job metadata should be a valid STAC collection with STAC items.
Afaict we do intersection checks at pystac.Item level for all other types of STAC resources, so it feels weird to say we do it as asset level here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openEO job metadata should be a valid STAC collection with STAC items.
That's unfortunately not the case in openEO below version 2.0 (which does not exist yet).
from the GET /jobs/{job_id}/results spec :
It is a valid STAC Item (if it has spatial and temporal references included) or a valid STAC Collection (supported since openEO API version 1.1.0).
...
To maintain backward compatibility, it is REQUIRED to still copy all assets in the STAC catalog structure into the collection-level assets. This requirement is planned to be removed in openEO API version 2.0.0.
The fact that we convert assets to items is indeed a bit weird, but I think that is just a temporary implementation detail (because of the spec constraints mentioned above). I think that the docstring should not go too much into detail here because it's unfortunately a confusing situation
| } | ||
| } | ||
| :param batch_jobs: Optional job registry used to resolve dependency-job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| :param batch_jobs: Optional job registry used to resolve dependency-job | |
| :param batch_jobs: Optional batch job API implementation used to resolve dependency-job |
(also a bit vague, but it any case this parameter is not (just) about job registry)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in the code batch_jobs is only used to get information from the job registry.
batch_jobs.get_job_info(job_id=job_id, user_id=user_id)
batch_jobs.get_result_assets(job_id=job.id, user_id=user.user_id)
We might have to pass something smaller here. Because we're passing an object with the following interface, while we only need some basic metadata from the job. As far as a user of load_stac can tell, someone could call delete_job or delete_batch_process_results when he passes this object.
__init__
set_default_sentinel_hub_credentials
set_terrascope_access_token_getter
create_job
get_job_info
poll_job_dependencies
get_user_jobs
get_job_output_dir
get_job_work_dir
start_job
_start_job
_determine_container_image_from_process_graph
_schedule_and_get_dependencies
get_result_metadata
get_result_assets
_results_metadata_to_assets
get_results_metadata_path
load_results_metadata
_load_results_metadata_from_file
_get_providers
get_log_entries
cancel_job
delete_job
_delete_job
delete_batch_process_results
delete_batch_process_dependency_sources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I'd also like to rework the "own job" flow a bit,
but as mentioned, this is the least relevant/urgent part of the load_stac refactor, so I would just stick to documenting the current state
| :param stac_io: Optional custom STAC I/O handler (e.g., to enable | ||
| authentication or caching). | ||
| TODO: Currently never provided in any load_stac call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| TODO: Currently never provided in any load_stac call. |
It is provided from CWL use cases from custom_processes in CDSE
Added a docstring to the load_stac function to clarify its usage.
The docstring focuses on aspects that weren’t immediately clear when reading the code for the first time, with the goal of helping users understand the function and its parameters without needing to dive into the implementation. I hope it can also help during refactoring to keep a bird's eye view of the total inputs, outputs, and goals of the function.