Skip to content

Conversation

@anuthebananu
Copy link
Contributor

@anuthebananu anuthebananu commented Oct 24, 2025

Summary & Motivation

Makes calculating the smallest time interval in a cron string deterministic for most common/simple cron strings. For more complex crons we fall back to the old sampling-based approach.

Also adds a guard against negative cron intervals to the sampling-based approach that can occur when the clocks roll back in a DST transition.

How I Tested These Changes

Lots of unit tests. Claude Code did a lot of heavy lifting here.

Changelog

Insert changelog entry or delete this section.

Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@anuthebananu anuthebananu changed the title improve get_smallest_cron_interval Improve get_smallest_cron_interval Oct 24, 2025
assert not is_valid_cron_string("0 0 32 1 *")


def test_get_smallest_cron_interval_basic():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved these tests to a dedicated test file and consolidated with other tests

@anuthebananu anuthebananu marked this pull request as ready for review October 24, 2025 22:03
@gibsondan
Copy link
Member

what bad thing happens if we remove the sampling-based approach entirely?

Copy link
Contributor Author

With the current implementation, we would lose the ability to accurately determine the smallest cron interval for more complex cron strings such as 0 0 15 * 1 (15th of the month and mondays)

i can try to have claude improve the approach so it can handle more of these cases

@gibsondan
Copy link
Member

what would the impact of that be in practice though? this is just to do some validation when the freshness policy is created right, do we need that to be there for all cron strings?

@gibsondan
Copy link
Member

or could it happen later when the cronstring is actually evaluated maybe?

Copy link
Contributor Author

Right now get_smallest_cron_interval is only called when instantiating a cron freshness policy as part of validation checks:

smallest_cron_interval = get_smallest_cron_interval(deadline_cron)
check.invariant(

specifically the validation is there to make sure lower_bound_delta fits within the smallest interval of the cron. lower_bound_delta is important because it denotes the earliest point prior to cron tick where the policy expects the asset to have materialized.

we allow the user to set the cron freshness policy's deadline_cron to any cron string recognized by Dagster, so we would need to be able to calculate the smallest cron interval for any of those cron strings

Copy link
Contributor Author

anuthebananu commented Oct 24, 2025

we could definitely try a conservative "best guess" approach for the more complex cases - so for a cron string with time intervals of varying length, we make a conservative guess at the smallest time interval. it's not perfect but probably good enough for this use case.

Copy link
Member

@gibsondan gibsondan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think aiming to get the sampling path out entirely would be a good idea. The main thing that troubles me about it is that it is called every time one of these objects is instantiated, including, for example, when we pull one out of the database when generating the asset graph in the daemon, for example. The fact that its non-deterministic is also a little troubling. For the first thing, if there was some way of distiniguishing between the codepaht when they are defined in code and the codepath when they are pulled out of the database that could also work.

Copy link
Contributor Author

In CronFreshnessPolicy.__new__ we already distinguish between a fresh instantiation and reading a serialized policy: https://github.com/dagster-io/dagster/blob/1a339db7b417373e55438c29d50bef85fb7c741a/python_modules/dagster/dagster/_core/definitions/freshness.py#L156-L160

I can move the get_smallest_cron_interval call to only run on fresh instantiation, we'd just have to assume that any serialized policy has a valid lower_bound_interval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants