-
Notifications
You must be signed in to change notification settings - Fork 192
feat: specify retry configuration overrides per subgraph #2172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Note Reviews pausedUse the following commands to manage reviews:
WalkthroughReplace primitive multi-argument subgraph-retry configuration with a structured, per-subgraph, expression-driven system: add SubgraphRetryOptions/NewSubgraphRetryOptions, a RetryExpressionManager, and an internal retry Manager; wire them into transports, graph server, router config, circuit breaker, traceclient, and tests. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Comment |
Router image scan passed✅ No security vulnerabilities found in image: |
f6f2adc to
ef2a9a6
Compare
8a7da2f to
d86e5cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
router/internal/retrytransport/retry_transport.go (1)
142-144: Honor context cancellation during backoff sleep.Abort retries promptly if the request context is canceled.
- // Wait for the specified duration - time.Sleep(sleepDuration) + // Wait for the specified duration or abort if canceled + select { + case <-req.Context().Done(): + return resp, req.Context().Err() + case <-time.After(sleepDuration): + }
🧹 Nitpick comments (25)
router-tests/structured_logging_test.go (1)
712-712: Consistent migration to object-based retry options in testsAll updated
WithSubgraphRetryOptions(core.NewSubgraphRetryOptions(... Enabled: false ...))calls look correct and keep logs deterministic by avoiding subgraph retries.If you find yourself disabling retries across many tests, consider a small helper in testenv (e.g.,
testenv.DisableSubgraphRetriesOption()) to reduce repetition and ease future changes.Also applies to: 832-832, 964-964, 1100-1100, 2214-2214
router-tests/telemetry/telemetry_test.go (2)
8695-8700: Per-subgraph retry builder usage looks correct; optionally assert “no retries” explicitlySwitching to
core.WithSubgraphRetryOptions(core.NewSubgraphRetryOptions(...))withEnabled: falsematches the new API and aligns with our prior learning that algorithms shouldn’t be validated when retries are disabled. If you want a stronger assertion, consider wiring anOnRetryFunc(if exposed onSubgraphRetryOptions) to increment a counter and assert it remains zero during this test.
9639-9642: Check all datapoints, not just the first, for absence of the subgraph attributeGuard against future exporter changes that may add more datapoints by asserting across the full set.
- data2 := httpRequestsMetric.Data.(metricdata.Sum[int64]) - atts := data2.DataPoints[0].Attributes - _, ok := atts.Value(attribute.Key(key)) - require.False(t, ok) + data2 := httpRequestsMetric.Data.(metricdata.Sum[int64]) + for _, dp := range data2.DataPoints { + _, ok := dp.Attributes.Value(attribute.Key(key)) + require.False(t, ok) + }router-tests/error_handling_test.go (1)
1383-1387: Good migration to manager-based retry config; consider a helper to DRY testsThe switch to core.WithSubgraphRetryOptions(core.NewSubgraphRetryOptions(...)) with Enabled:false is correct and matches our prior learning that disabled retries shouldn’t validate algorithm/fields. You repeat the same disabled config 4x; a tiny helper keeps tests concise.
Apply this to replace the repeated blocks:
- core.WithSubgraphRetryOptions(core.NewSubgraphRetryOptions(config.TrafficShapingRules{ - All: config.GlobalSubgraphRequestRule{ - BackoffJitterRetry: config.BackoffJitterRetry{Enabled: false}, - }, - })), + disabledSubgraphRetries(),Add this helper near the top of the file (outside the selected ranges):
// test helper func disabledSubgraphRetries() core.Option { return core.WithSubgraphRetryOptions(core.NewSubgraphRetryOptions(config.TrafficShapingRules{ All: config.GlobalSubgraphRequestRule{ BackoffJitterRetry: config.BackoffJitterRetry{Enabled: false}, }, })) }Also applies to: 1419-1423, 1455-1459, 1491-1495
router/core/graph_server.go (1)
266-287: router/core/graph_server.go: guard nil retryOptions and wrap Initialize error- if s.retryOptions.IsEnabled() { + if s.retryOptions != nil && s.retryOptions.IsEnabled() { retryExprManager := expr.NewRetryExpressionManager() // … err = retryManager.Initialize( s.retryOptions.All, s.retryOptions.SubgraphMap, routerConfig.GetSubgraphs(), ) - if err != nil { - return nil, err - } + if err != nil { + return nil, fmt.Errorf("failed to initialize retry manager: %w", err) + } s.retryManager = retryManager }router/internal/expr/retry_expression.go (3)
16-23: Constructor is fine; optional: precompile default expressionAs a small UX improvement, consider precompiling the default expression during construction so first evaluations don’t depend on prior AddExpression calls.
Example (if you decide to change signature elsewhere to allow error return):
m := &RetryExpressionManager{expressionMap: make(map[string]*vm.Program)} _ = m.AddExpression("") // ignore error; Initialize can still validate real strings return m
25-50: Normalize expression key or fix the commentComment says “normalized” but no normalization occurs. Either trim/normalize or update the comment.
import ( "fmt" "reflect" + "strings" "github.com/expr-lang/expr" "github.com/expr-lang/expr/vm" ) @@ -func (c *RetryExpressionManager) AddExpression(exprString string) error { - expression := exprString +func (c *RetryExpressionManager) AddExpression(exprString string) error { + expression := strings.TrimSpace(exprString) @@ - // Use the normalized expression string as the key for deduplication + // Use the trimmed expression string as the key for deduplication c.expressionMap[expression] = program return nil }
53-80: Consider lazy compilation on cache miss instead of silently returning falseFail-closed is safe, but lazily compiling unknown expressions can reduce surprises if Initialize missed one.
func (m *RetryExpressionManager) ShouldRetry(ctx RetryContext, expressionString string) (bool, error) { if m == nil { return false, nil } - expression := expressionString + expression := expressionString if expression == "" { expression = defaultRetryExpression } program, ok := m.expressionMap[expression] if !ok { - // If the expression wasn't pre-compiled, do not retry by default - return false, nil + // Lazily compile the expression once + if err := m.AddExpression(expression); err != nil { + return false, err + } + program = m.expressionMap[expression] }If AddExpression can be invoked concurrently in your environment, add an RWMutex around expressionMap accesses.
Would you like me to add a minimal, lock-guarded version?
router/internal/expr/retry_expression_test.go (2)
141-146: Strengthen the empty-expression testConsider asserting runtime behavior as well (e.g., ShouldRetry returns false) to guard against regressions, not just that AddExpression("") succeeds and the manager is non-nil.
err := manager.AddExpression("") assert.NoError(t, err) assert.NotNil(t, manager) +res, evalErr := manager.ShouldRetry(RetryContext{}, "") +assert.NoError(t, evalErr) +assert.False(t, res)
528-551: Parallelize subtests to speed up and isolate failures (optional)You can mark the top-level test and subtests with t.Parallel() for faster runs; table-driven subtests here are good candidates.
func TestRetryExpressionManager_WithSyscallErrors(t *testing.T) { + t.Parallel() t.Run("expression combining status and syscall errors", func(t *testing.T) { + t.Parallel()router/core/router.go (3)
187-206: SubgraphRetryOptions: shape and IsEnabled() look goodStructure and enablement logic are sound; nil map iteration is safe. Minor: consider doc comments on fields for clarity (global “All” vs per-subgraph overrides).
1908-1923: Comment mismatch: “circuit breakers” → “retry options”The comment under NewSubgraphRetryOptions mentions circuit breakers; it should refer to retry options.
- // Subgraph specific circuit breakers + // Subgraph-specific retry options
1924-1937: Default algorithm only when retries are enabled (aligns with team preference)Per team learning, when retries are disabled we shouldn’t “enforce” algorithm defaults. Guard the defaulting by Enabled to better reflect that stance.
-func newRetryConfig(config config.BackoffJitterRetry) retrytransport.RetryOptions { - algorithm := config.Algorithm - if algorithm == "" { - algorithm = retrytransport.BackoffJitter - } +func newRetryConfig(config config.BackoffJitterRetry) retrytransport.RetryOptions { + algorithm := config.Algorithm + if config.Enabled && algorithm == "" { + algorithm = retrytransport.BackoffJitter + } return retrytransport.RetryOptions{ Enabled: config.Enabled, Algorithm: algorithm, MaxRetryCount: config.MaxAttempts, MaxDuration: config.MaxDuration, Interval: config.Interval, Expression: config.Expression, } }router/core/retry_builder.go (1)
15-23: Defensive nil check for req (optional)If ShouldRetry ever gets a nil req from the transport layer, req.Context() will panic. Cheap guard:
-reqContext := getRequestContext(req.Context()) +if req == nil { + return false +} +reqContext := getRequestContext(req.Context())router-tests/retry_test.go (1)
76-78: Avoid brittle full-JSON equality in assertions.These assertions will break on harmless message changes. Prefer checking essential fields (e.g., statusCode) or comparing decoded JSON structures.
Example:- require.Equal(t, `{"errors":[{"message":"Failed to fetch from Subgraph 'employees', Reason: empty response.","extensions":{"statusCode":502}}],"data":{"employees":null}}`, res.Body) + require.JSONEq(t, `{"data":{"employees":null}}`, res.Body) + require.Contains(t, res.Body, `"statusCode":502`)Also applies to: 127-129, 180-182, 296-298, 360-372
router/internal/retrytransport/retry_transport_test.go (2)
268-298: Add a zero-retries guard test.Add a case with MaxRetryCount=0 to assert no retries occur and OnRetry isn’t called.
Sketch:
t.Run("no retries when MaxRetryCount=0", func(t *testing.T) { retries, attempts := 0, 0 mgr := newTestManager(simpleShouldRetry, func(int, *http.Request, *http.Response, time.Duration, error) { retries++ }, RetryOptions{Enabled: true, MaxRetryCount: 0, Interval: time.Millisecond, MaxDuration: time.Second}, "sg") tr := NewRetryHTTPTransport(&MockTransport{handler: func(*http.Request) (*http.Response, error) { attempts++; return nil, errors.New("fail") }}, func(*http.Request) *zap.Logger { return zap.NewNop() }, mgr) _, err := tr.RoundTrip(withSubgraph(httptest.NewRequest("GET", "http://x", nil), "sg")) assert.Error(t, err) assert.Equal(t, 0, retries) assert.Equal(t, 1, attempts) })
383-414: Log-string assertions are inherently brittle.If wording changes, tests fail. Consider matching on a stable key/field or use zap observer to assert fields instead of formatted strings.
router/core/retry_builder_test.go (4)
52-57: Test name doesn’t match behavior.No “disabled” path is exercised here; the test only builds the function. Rename to “build function returns non-nil with empty manager” or add a real disabled-case.
-t.Run("build function when retry is disabled", func(t *testing.T) { +t.Run("build function returns non-nil with empty manager", func(t *testing.T) {
176-189: Clarify EOF comment.Comment says “EOF is now handled at transport layer,” but this test expects true here via default retryable error handling. Tweak the comment to avoid confusion.
- // EOF is now handled at transport layer, not expression + // EOF can be retried via default transport-level detection; expression is false here
442-485: Remove stale, commented-out tests.Dead code makes maintenance harder. Either port to the new manager-based API or delete.
-// TODO: Fix -//func TestProcessRetryOptions(t *testing.T) { -// ... (entire commented block) -//} +// (removed legacy ProcessRetryOptions tests; superseded by expression-manager and manager-based flows)
51-57: Add a “no request context” guard test.BuildRetryFunction returns false when request context is missing. Add a unit test to lock this in.
Sketch:
t.Run("returns false when request context missing", func(t *testing.T) { manager := expr.NewRetryExpressionManager() _ = manager.AddExpression("true") fn, _ := BuildRetryFunction(manager) req, _ := http.NewRequest("POST", "http://example.com/graphql", nil) // no ctx attached assert.False(t, fn(nil, req, &http.Response{StatusCode: 500}, "true")) })Want me to open a follow-up PR adding this?
Also applies to: 139-146
router/internal/retrytransport/manager.go (2)
88-91: Remove unreachable config-not-found branch.customSgNames is only populated when a subgraph has an entry in subgraphRetryOptions; this branch cannot trigger.
- entry, ok := subgraphRetryOptions[sgName] - if !ok { - joinErr = errors.Join(joinErr, errors.New("retry config not found for subgraph "+sgName)) - continue - } + entry := subgraphRetryOptions[sgName]
118-121: Rename local variable for clarity.Avoid “circuitBreaker” in retry code.
- if circuitBreaker, ok := c.retries[name]; ok { - return circuitBreaker + if opts, ok := c.retries[name]; ok { + return optsrouter/internal/retrytransport/retry_transport.go (2)
145-151: Option: drain before sleeping to maximize connection reuse.Draining immediately frees the connection for reuse while we wait.
- // drain the previous response before retrying - rt.drainBody(resp, requestLogger) - - // Retry the request - resp, err = rt.roundTripper.RoundTrip(req) + // drain the previous response before waiting/retrying + rt.drainBody(resp, requestLogger) + // Retry the request + resp, err = rt.roundTripper.RoundTrip(req)
50-56: Log level nit: malformed Retry-After.Consider Warn instead of Error to reduce noise for client-driven header issues.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (17)
router-tests/error_handling_test.go(4 hunks)router-tests/panic_test.go(2 hunks)router-tests/retry_test.go(10 hunks)router-tests/structured_logging_test.go(5 hunks)router-tests/telemetry/telemetry_test.go(2 hunks)router/core/graph_server.go(4 hunks)router/core/retry_builder.go(2 hunks)router/core/retry_builder_test.go(15 hunks)router/core/router.go(3 hunks)router/core/router_config.go(2 hunks)router/core/supervisor_instance.go(1 hunks)router/core/transport.go(6 hunks)router/internal/expr/retry_expression.go(2 hunks)router/internal/expr/retry_expression_test.go(2 hunks)router/internal/retrytransport/manager.go(1 hunks)router/internal/retrytransport/retry_transport.go(4 hunks)router/internal/retrytransport/retry_transport_test.go(33 hunks)
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.637Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.
Applied to files:
router/core/supervisor_instance.gorouter/core/router.gorouter-tests/telemetry/telemetry_test.gorouter/core/router_config.gorouter-tests/retry_test.gorouter-tests/error_handling_test.gorouter-tests/structured_logging_test.gorouter-tests/panic_test.go
📚 Learning: 2025-08-28T10:07:27.637Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.637Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
Applied to files:
router/core/supervisor_instance.gorouter/core/router.gorouter-tests/retry_test.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.
Applied to files:
router-tests/telemetry/telemetry_test.gorouter-tests/error_handling_test.gorouter-tests/structured_logging_test.gorouter-tests/panic_test.go
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.
Applied to files:
router-tests/telemetry/telemetry_test.go
📚 Learning: 2025-08-26T17:30:41.212Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/core/retry_expression_builder.go:3-11
Timestamp: 2025-08-26T17:30:41.212Z
Learning: In router/core/retry_expression_builder.go, the team prefers to keep EOF error detection simple with string matching ("unexpected eof") rather than using errors.Is() for io.EOF and io.ErrUnexpectedEOF. They want to wait for customer feedback before making it more robust, since customers can modify retry expressions to handle specific cases.
Applied to files:
router/internal/expr/retry_expression_test.gorouter/core/retry_builder_test.gorouter/core/retry_builder.go
📚 Learning: 2025-08-28T09:59:19.956Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/core/supervisor_instance.go:196-204
Timestamp: 2025-08-28T09:59:19.956Z
Learning: In router/pkg/config/config.go, the BackoffJitterRetry struct uses envDefault tags to provide default values for Algorithm ("backoff_jitter") and Expression ("IsRetryableStatusCode() || IsConnectionError() || IsTimeout()"), ensuring that even when YAML configs omit these fields, valid defaults are applied automatically, preventing ProcessRetryOptions from rejecting empty values.
Applied to files:
router/core/retry_builder.go
📚 Learning: 2025-08-26T17:19:28.178Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/internal/expr/retry_context.go:3-10
Timestamp: 2025-08-26T17:19:28.178Z
Learning: In the Cosmo router codebase, prefer using netErr.Timeout() checks over explicit context.DeadlineExceeded checks for timeout detection in retry logic, as Go's networking stack typically wraps context deadline exceeded errors in proper net.Error implementations.
Applied to files:
router/internal/retrytransport/retry_transport_test.go
🧬 Code graph analysis (17)
router/core/supervisor_instance.go (1)
router/core/router.go (2)
WithSubgraphRetryOptions(1776-1780)NewSubgraphRetryOptions(1908-1922)
router/core/router.go (2)
router/internal/retrytransport/manager.go (3)
RetryOptions(25-32)OnRetryFunc(17-17)BackoffJitter(22-22)router/pkg/config/config.go (2)
TrafficShapingRules(164-171)BackoffJitterRetry(231-238)
router-tests/telemetry/telemetry_test.go (2)
router/core/router.go (2)
WithSubgraphRetryOptions(1776-1780)NewSubgraphRetryOptions(1908-1922)router/pkg/config/config.go (3)
TrafficShapingRules(164-171)GlobalSubgraphRequestRule(190-207)BackoffJitterRetry(231-238)
router/internal/expr/retry_expression_test.go (2)
router/internal/expr/retry_expression.go (1)
NewRetryExpressionManager(19-23)router/internal/expr/retry_context.go (1)
LoadRetryContext(132-146)
router/core/router_config.go (1)
router/core/router.go (1)
SubgraphRetryOptions(187-191)
router/internal/expr/retry_expression.go (1)
router/internal/expr/retry_context.go (1)
RetryContext(13-19)
router-tests/retry_test.go (4)
router/core/router.go (3)
NewSubgraphRetryOptions(1908-1922)WithSubgraphRetryOptions(1776-1780)Option(172-172)router/pkg/config/config.go (4)
TrafficShapingRules(164-171)GlobalSubgraphRequestRule(190-207)BackoffJitterRetry(231-238)Config(985-1059)router/internal/retrytransport/manager.go (1)
OnRetryFunc(17-17)router-tests/testenv/testenv.go (4)
Run(105-122)RunWithError(135-150)Environment(1727-1763)SubgraphsConfig(365-378)
router-tests/error_handling_test.go (2)
router/core/router.go (2)
WithSubgraphRetryOptions(1776-1780)NewSubgraphRetryOptions(1908-1922)router/pkg/config/config.go (3)
TrafficShapingRules(164-171)GlobalSubgraphRequestRule(190-207)BackoffJitterRetry(231-238)
router/core/graph_server.go (4)
router/internal/retrytransport/manager.go (3)
Manager(34-40)NewManager(42-49)OnRetryFunc(17-17)router/internal/expr/expr_manager.go (1)
Manager(15-17)router/internal/expr/retry_expression.go (1)
NewRetryExpressionManager(19-23)router/core/retry_builder.go (1)
BuildRetryFunction(13-46)
router-tests/structured_logging_test.go (2)
router/core/router.go (2)
WithSubgraphRetryOptions(1776-1780)NewSubgraphRetryOptions(1908-1922)router/pkg/config/config.go (3)
TrafficShapingRules(164-171)GlobalSubgraphRequestRule(190-207)BackoffJitterRetry(231-238)
router/internal/retrytransport/manager.go (2)
router/internal/expr/expr.go (1)
Request(65-74)router/internal/expr/retry_expression.go (1)
RetryExpressionManager(12-14)
router/core/retry_builder_test.go (3)
router/internal/expr/retry_expression.go (1)
NewRetryExpressionManager(19-23)router/core/retry_builder.go (1)
BuildRetryFunction(13-46)router/core/context.go (2)
OperationTypeQuery(488-488)OperationTypeMutation(489-489)
router/internal/retrytransport/retry_transport.go (3)
router/internal/retrytransport/manager.go (1)
Manager(34-40)router/internal/expr/expr.go (2)
Request(65-74)Context(34-38)router/internal/context/keys.go (1)
CurrentSubgraphContextKey(7-7)
router/core/retry_builder.go (2)
router/internal/expr/retry_expression.go (1)
RetryExpressionManager(12-14)router/internal/retrytransport/manager.go (1)
ShouldRetryFunc(16-16)
router/internal/retrytransport/retry_transport_test.go (5)
router/internal/expr/expr.go (2)
Request(65-74)Context(34-38)router/internal/context/keys.go (1)
CurrentSubgraphContextKey(7-7)router/internal/retrytransport/manager.go (4)
OnRetryFunc(17-17)RetryOptions(25-32)Manager(34-40)NewManager(42-49)router/internal/expr/retry_expression.go (1)
NewRetryExpressionManager(19-23)router/internal/retrytransport/retry_transport.go (1)
NewRetryHTTPTransport(77-87)
router-tests/panic_test.go (2)
router/core/router.go (2)
WithSubgraphRetryOptions(1776-1780)NewSubgraphRetryOptions(1908-1922)router/pkg/config/config.go (3)
TrafficShapingRules(164-171)GlobalSubgraphRequestRule(190-207)BackoffJitterRetry(231-238)
router/core/transport.go (2)
router/internal/retrytransport/manager.go (1)
Manager(34-40)router/internal/retrytransport/retry_transport.go (1)
NewRetryHTTPTransport(77-87)
🔇 Additional comments (23)
router/core/router_config.go (1)
79-79: Switch to pointer: ensure initialization path or guard against nilChanging
retryOptionsto*SubgraphRetryOptionsis fine, but it increases the chance of nil deref at use sites unless always initialized.Please confirm that
core.Config.retryOptionsis assigned during router setup. Currently,WithSubgraphRetryOptionssetsr.retryOptions(see router/core/router.go), but I don’t see wherec.retryOptions(this struct) is populated. If not populated, prefer guarding inUsage()(see below) or setting it alongside the router field.router/core/supervisor_instance.go (1)
196-197: Guard disabled subgraph retries before defaulting algorithm
In NewSubgraphRetryOptions (router/core/router.go ¶1909–1921), you always call newRetryConfig for each subgraph, which unconditionally defaults an empty algorithm—even when Enabled==false. Per our retry-disabled rule, skip both the call and algorithm default for subgraphs with v.BackoffJitterRetry.Enabled==false. Wrap the loop entry in an if v.BackoffJitterRetry.Enabled check so disabled rules retain an empty algorithm.⛔ Skipped due to learnings
Learnt from: SkArchon PR: wundergraph/cosmo#2167 File: router-tests/structured_logging_test.go:712-713 Timestamp: 2025-08-28T10:07:27.637Z Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.Learnt from: endigma PR: wundergraph/cosmo#2155 File: router/core/router.go:1857-1866 Timestamp: 2025-08-20T10:08:17.857Z Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.Learnt from: endigma PR: wundergraph/cosmo#2155 File: router/core/router.go:1857-1866 Timestamp: 2025-08-20T10:08:17.857Z Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.Learnt from: SkArchon PR: wundergraph/cosmo#2167 File: router/core/supervisor_instance.go:196-204 Timestamp: 2025-08-28T09:59:19.956Z Learning: In router/pkg/config/config.go, the BackoffJitterRetry struct uses envDefault tags to provide default values for Algorithm ("backoff_jitter") and Expression ("IsRetryableStatusCode() || IsConnectionError() || IsTimeout()"), ensuring that even when YAML configs omit these fields, valid defaults are applied automatically, preventing ProcessRetryOptions from rejecting empty values.router-tests/panic_test.go (1)
51-55: Good: retries explicitly disabled for panic-path stabilitySwitching to
core.NewSubgraphRetryOptions(... Enabled: false ...)ensures the router won’t add subgraph retries that could mask panic timing in these tests. Matches the disabled-validation expectation.Also applies to: 87-91
router/core/graph_server.go (3)
43-43: Import looks correctretrytransport is used below; import is appropriate.
104-105: Field addition LGTMStoring the retry manager on graphServer is fine and consistent with how other managers (circuit) are handled.
1208-1209: Wiring RetryManager into transport options is correctExecutorConfigurationBuilder receives the manager; assuming transport handles nil, this is safe.
If not already covered in tests, add one integration that omits WithSubgraphRetryOptions entirely to ensure nil RetryManager is handled gracefully.
router/internal/expr/retry_expression_test.go (2)
125-135: Manager-based API usage looks correctInstantiating the manager, pre-registering expressions with AddExpression, and evaluating via ShouldRetry(ctx, expr) matches the new API. Good assertions on compile-time errors for invalid expressions.
496-525: Good syscall coverage with the manager pathNice coverage for ECONNREFUSED/ECONNRESET/ETIMEDOUT under the manager-evaluated expression. This ensures helper wiring remains intact after the API shift.
router/core/router.go (1)
1776-1779: ConfirmedretryOptionsfield exists on Router.Config
Router embedsConfig, which declaresretryOptions *SubgraphRetryOptions, so theWithSubgraphRetryOptionssetter compiles as expected.router/core/retry_builder.go (1)
55-58: EOF-as-default-retryable matches stated team preferenceKeeping this simple with string matching for “unexpected eof” is consistent with prior guidance to wait for customer feedback before broadening detection.
router-tests/retry_test.go (4)
38-51: Good adoption of builder-style retry options.Switching to NewSubgraphRetryOptions + WithSubgraphRetryOptions makes tests clearer and mirrors the new API.
247-306: Exact sleep equality on Retry-After is OK here.Asserting sleepDuration equals header seconds is deterministic since jitter shouldn’t apply when Retry-After is honored.
649-674: Great coverage of mixed global + per-subgraph overrides.Validates default-from-All and targeted override behavior cleanly.
Also applies to: 676-719
363-371: Unable to run flakiness check in sandbox
Thego testinvocation failed due to missing Go module context. Please run locally and confirm stability:go test ./router-tests -run 'TestFlakyRetry/verify max duration is not exceeded on intervals$' -count=50 -vIf you observe intermittent failures, remove
t.Parallel()from this subtest or widen the timing tolerances around the expected duration.router/internal/retrytransport/retry_transport_test.go (3)
58-61: withSubgraph helper looks correct.Uses the typed context key; safe and explicit.
204-231: Solid body-draining tests.Covers read and close failure paths with log assertions. Good observability checks.
Also applies to: 300-357
483-538: Thorough 429 coverage (seconds, HTTP-date, invalid, negative, mixed).Great breadth; confirms shouldUseRetryAfter behavior and fallbacks.
Also applies to: 540-596, 598-636, 638-697, 699-741, 742-782, 784-863
router/core/retry_builder_test.go (3)
60-85: Default expression behavior assertions look good.Covers 5xx and default network errors.
241-270: Mutation short-circuit behavior well covered.Ensures expressions can’t override the “no retry on mutations” rule.
390-439: Comprehensive positive matrix + mutation negative matrix.Good end-to-end validation of the expression path.
router/internal/retrytransport/manager.go (1)
61-69: LGTM on skipping algorithm validation when retries are disabled.Disabled per-subgraph entries are excluded from validation, aligning with the prior decision captured in learnings.
router/core/transport.go (2)
105-107: Wiring to Manager-based retry looks correct.Conditional wrapping with NewRetryHTTPTransport is clean and preserves the base round tripper path when disabled.
73-75: All call sites updated. Verified NewCustomTransport/NewRetryHTTPTransport and RetryManager usage in router/core/transport.go (lines 106, 367, 388, 428) and router/core/graph_server.go (line 1208).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
router/internal/retrytransport/retry_transport.go (1)
145-151: Rewind request body before retrying; otherwise subsequent attempts may send an empty/partial body.GraphQL subgraph calls are typically POST with bodies. Without resetting Body via GetBody, retries can be broken.
// drain the previous response before retrying rt.drainBody(resp, requestLogger) // Retry the request - resp, err = rt.roundTripper.RoundTrip(req) + // Reset request body if necessary + if req.Body != nil && req.Body != http.NoBody { + if req.GetBody == nil { + requestLogger.Warn("Cannot retry request with non-rewindable body") + break + } + newBody, gbErr := req.GetBody() + if gbErr != nil { + requestLogger.Error("Failed to reset request body for retry", zap.Error(gbErr)) + return resp, gbErr + } + req.Body = newBody + } + resp, err = rt.roundTripper.RoundTrip(req)
♻️ Duplicate comments (3)
router-tests/retry_test.go (3)
503-531: Great: base-level invalid algorithm is correctly rejected at init.
Matches expected behavior and error text.
533-558: Nice: invalid algorithm is ignored when retries are disabled (base).
This addresses the previously requested coverage.
560-587: Nice: per-subgraph “disabled ignores algorithm” case covered.
Prevents regressions in validation gating.
🧹 Nitpick comments (14)
router-tests/retry_test.go (8)
38-51: DRY the options setup with a small helper to reduce duplication.
Multiple tests repeat the same NewSubgraphRetryOptions + WithSubgraphRetryOptions pattern; a local builder will simplify maintenance.Apply this helper once near the top and replace repeated blocks:
+func withRetryOpts(cfg config.TrafficShapingRules, onRetry func(int, *http.Request, *http.Response, time.Duration, error)) core.Option { + opts := core.NewSubgraphRetryOptions(cfg) + opts.OnRetryFunc = onRetry + return core.WithSubgraphRetryOptions(opts) +}Also applies to: 92-105, 145-158, 199-212, 260-273, 324-337, 391-404, 451-464
224-231: Decouple success trigger from OnRetry callback semantics.
Using onRetryCounter to flip success couples the test to callback timing. Drive success via request attempt count to avoid false positives if OnRetry invocation timing changes.- if onRetryCounter.Load() == int32(maxAttemptsBeforeServiceSucceeds) { + if serviceCallsCounter.Load() == int32(maxAttemptsBeforeServiceSucceeds) { w.WriteHeader(http.StatusOK) _, _ = w.Write([]byte(`{"data":{"employees":[{"id":1},{"id":2}]}}`)) } else { w.WriteHeader(http.StatusBadGateway) } serviceCallsCounter.Add(1)
256-257: Speed up the 429 + Retry-After test.
This test currently sleeps ~3s (3 retries × 1s). One retry is sufficient to validate header precedence.- maxRetryCount := 3 + maxRetryCount := 1
433-434: Avoid flakiness: assert a range instead of NotEqual.
Exact equality with retryInterval can sporadically occur depending on jitter/rounding.- require.NotEqual(t, sleepDuration.Load(), int64(retryInterval)) + d := time.Duration(sleepDuration.Load()) + require.Greater(t, d, time.Duration(0)) + require.Less(t, d, 2*retryInterval)
494-495: Same here: use bounded checks to reduce flakiness.
Zero Retry-After should fall back to jitter/backoff; prefer range assertions.- require.NotEqual(t, sleepDuration.Load(), int64(retryInterval)) + d := time.Duration(sleepDuration.Load()) + require.Greater(t, d, time.Duration(0)) + require.Less(t, d, 2*retryInterval)
360-362: Prefer JSON structural equality over string equality for response bodies.
String comparisons are brittle to formatting/ordering changes.- require.Equal(t, `{"errors":[{"message":"Failed to fetch from Subgraph 'employees', Reason: empty response.","extensions":{"statusCode":502}}],"data":{"employees":null}}`, res.Body) + require.JSONEq(t, `{"errors":[{"message":"Failed to fetch from Subgraph 'employees', Reason: empty response.","extensions":{"statusCode":502}}],"data":{"employees":null}}`, res.Body)(Apply similarly to the other exact body checks.)
Also applies to: 681-685, 761-763
366-372: Widen timing slack to reduce CI flakes.
A 20ms/100ms margin can be tight under load; slightly larger buffers help.- shouldBeLessThanDuration := (time.Duration(maxRetryCount-1) * retryInterval) - (20 * time.Millisecond) + shouldBeLessThanDuration := (time.Duration(maxRetryCount-1) * retryInterval) - (50 * time.Millisecond) ... - expectedMinDuration := (time.Duration(maxRetryCount-1) * maxDuration) - (100 * time.Millisecond) + expectedMinDuration := (time.Duration(maxRetryCount-1) * maxDuration) - (150 * time.Millisecond)
621-697: Per-subgraph retry counts validated correctly.
Isolated counters for employees vs test1 look good.If useful, I can add a combined query hitting both subgraphs in a single operation to verify independent retry accounting within one request.
router/internal/retrytransport/retry_transport.go (6)
142-144: Honor context cancellation/deadlines during backoff sleep.Avoid sleeping past ctx deadlines; exit early if canceled.
- // Wait for the specified duration - time.Sleep(sleepDuration) + // Wait for the specified duration or context cancellation + if sleepDuration > 0 { + select { + case <-req.Context().Done(): + return resp, req.Context().Err() + case <-time.After(sleepDuration): + } + }
59-66: Also honor Retry-After for 503 Service Unavailable.RFC 7231 allows Retry-After on 503 responses as well as 429. This improves interoperability with upstreams.
- if resp == nil || resp.StatusCode != http.StatusTooManyRequests { + if resp == nil || (resp.StatusCode != http.StatusTooManyRequests && resp.StatusCode != http.StatusServiceUnavailable) { return 0, false }
124-128: Generic log for Retry-After usage (it may be 429 or 503).- requestLogger.Debug("Using Retry-After header for 429 response", + requestLogger.Debug("Using Retry-After header", zap.Int("retry", retries), zap.String("url", req.URL.String()), zap.Duration("retry-after", sleepDuration), )
51-53: Lower log level for malformed Retry-After headers.Header parsing errors are client-observable but non-fatal; debug (or warn) is less noisy in production.
- logger.Error("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin)) + logger.Debug("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin))
173-176: Tighten the comment wording.- // When we close the body only will go internally marks the persisted connection as true - // which is important so that it can reuse the connection internally for retrying + // Draining then closing marks the connection as reusable, allowing the client to reuse it for retries.
105-111: Optional: short-circuit when retries are effectively disabled.If MaxRetryCount == 0 (or options are disabled), return immediately to avoid backoff setup and logging overhead. Note: per team learning, IsEnabled() handles nil receivers safely if available.
retryOptions := rt.retryManager.GetSubgraphOptions(subgraph) - if retryOptions == nil { - return resp, nil - } + if retryOptions == nil { + return resp, err + } + // Optional fast-exit: + // if retryOptions.MaxRetryCount == 0 { return resp, err } + // or if available: + // if !retryOptions.IsEnabled() { return resp, err }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
router-tests/retry_test.go(10 hunks)router/internal/retrytransport/manager.go(1 hunks)router/internal/retrytransport/retry_transport.go(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- router/internal/retrytransport/manager.go
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-09-01T13:44:13.033Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.033Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false as the first check, so no external nil checks are needed when calling c.retryOptions.IsEnabled(). This is a consistent pattern used across multiple IsEnabled() methods in the codebase.
Applied to files:
router-tests/retry_test.gorouter/internal/retrytransport/retry_transport.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.
Applied to files:
router-tests/retry_test.go
📚 Learning: 2025-09-01T13:44:13.033Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.033Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false, so no external nil checks are needed when calling c.retryOptions.IsEnabled().
Applied to files:
router-tests/retry_test.gorouter/internal/retrytransport/retry_transport.go
📚 Learning: 2025-08-28T10:07:27.637Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.637Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
Applied to files:
router-tests/retry_test.go
🧬 Code graph analysis (2)
router-tests/retry_test.go (4)
router/core/router.go (3)
NewSubgraphRetryOptions(1908-1922)WithSubgraphRetryOptions(1776-1780)Option(172-172)router/pkg/config/config.go (4)
TrafficShapingRules(164-171)GlobalSubgraphRequestRule(190-207)BackoffJitterRetry(231-238)Config(985-1059)router/internal/retrytransport/manager.go (1)
OnRetryFunc(17-17)router-tests/testenv/testenv.go (4)
Run(105-122)RunWithError(135-150)Environment(1727-1763)SubgraphsConfig(365-378)
router/internal/retrytransport/retry_transport.go (3)
router/internal/retrytransport/manager.go (1)
Manager(34-40)router/internal/expr/expr.go (2)
Request(65-74)Context(34-38)router/internal/context/keys.go (1)
CurrentSubgraphContextKey(7-7)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: build-router
- GitHub Check: build_push_image
- GitHub Check: integration_test (./events)
- GitHub Check: build_push_image (nonroot)
- GitHub Check: image_scan (nonroot)
- GitHub Check: build_test
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: integration_test (./telemetry)
- GitHub Check: image_scan
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (go)
🔇 Additional comments (4)
router-tests/retry_test.go (4)
10-13: Good import adjustments for new API usage.
The added imports align with the builder-style options and testify assertions.
16-23: OnRetry counter helper is concise and thread-safe.
Atomic counters and nil-guard for duration are correct.
589-619: Per-subgraph invalid algorithm correctly fails fast.
Solid negative test to mirror the base-level case.
699-775: Mixed global + per-subgraph overrides behave as intended.
Employees inherit global, test1 overrides; counters confirm isolation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
router/internal/retrytransport/retry_transport.go (1)
147-152: Critical: retries will resend a consumed request bodyOn retry, req.Body is already consumed. If req.GetBody is unset, retries are unsafe; if set, reset Body before each retry. Otherwise you’ll send an empty body or fail.
Apply:
- // Wait for the specified duration - time.Sleep(sleepDuration) - - // drain the previous response before retrying - rt.drainBody(resp, requestLogger) + // Drain previous response ASAP to free the connection for reuse + rt.drainBody(resp, requestLogger) + + // Respect request cancellation while waiting + if sleepDuration > 0 { + timer := time.NewTimer(sleepDuration) + select { + case <-req.Context().Done(): + timer.Stop() + return resp, req.Context().Err() + case <-timer.C: + } + } + + // Rewind request body if present + if req.Body != nil { + if req.GetBody == nil { + requestLogger.Debug("cannot retry: non-rewindable request body") + break + } + if newBody, gErr := req.GetBody(); gErr != nil { + return resp, gErr + } else { + req.Body = newBody + } + } // Retry the request resp, err = rt.roundTripper.RoundTrip(req)
♻️ Duplicate comments (1)
router/internal/retrytransport/retry_transport.go (1)
104-108: Nice fix: avoids a second request when no retry optionsReturning the outcome of the first attempt preserves semantics and errors.
🧹 Nitpick comments (5)
router/internal/retrytransport/retry_transport.go (5)
58-75: Honor Retry-After for 503 as well and accept zero-second delaysRFC 7231 defines Retry-After for 503 and 429. Also treat “0” as a valid immediate retry (no sleep), not a fallback to random backoff.
Apply:
-func shouldUseRetryAfter(logger *zap.Logger, resp *http.Response, maxDuration time.Duration) (time.Duration, bool) { - if resp == nil || resp.StatusCode != http.StatusTooManyRequests { +func shouldUseRetryAfter(logger *zap.Logger, resp *http.Response, maxDuration time.Duration) (time.Duration, bool) { + if resp == nil || (resp.StatusCode != http.StatusTooManyRequests && resp.StatusCode != http.StatusServiceUnavailable) { return 0, false } @@ - return duration, duration > 0 + // Accept immediate retry (0s) as valid + return duration, duration >= 0
22-56: Minor: trim header value before parsing; lower log level on parse failuresReal-world proxies sometimes add whitespace; parsing “ 120 ” should work. Also consider debug level to reduce noise.
Apply:
import ( "errors" "io" "net/http" + "strings" "strconv" "time" @@ -func parseRetryAfterHeader(logger *zap.Logger, retryAfter string) time.Duration { +func parseRetryAfterHeader(logger *zap.Logger, retryAfter string) time.Duration { if retryAfter == "" { return 0 } var errJoin error - seconds, err := strconv.Atoi(retryAfter) + retryAfter = strings.TrimSpace(retryAfter) + seconds, err := strconv.Atoi(retryAfter) @@ - if errJoin != nil { - logger.Error("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin)) - } + if errJoin != nil { + logger.Debug("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin)) + }
144-149: Drain before sleeping to free connections earlierDraining before the wait reduces connection pinning during long sleeps.
(Already included in the diff above.)
183-187: Guard against nil resp in helpers (defensive)Not strictly necessary given current call sites, but adding a nil check makes the helper safer.
Apply:
func isResponseOK(resp *http.Response) bool { + if resp == nil { + return false + } // Ensure we don't wait for no reason when subgraphs don't behave // spec-compliant and returns a different status code than 200. return resp.StatusCode >= 200 && resp.StatusCode < 300 }
175-181: Polish the comment grammarImprove clarity of the connection reuse note.
Apply:
- // When we close the body only will go internally marks the persisted connection as true - // which is important so that it can reuse the connection internally for retrying + // Fully consuming and closing the body marks the connection reusable + // so the Transport can keep-alive and reuse it for retries.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
router/internal/retrytransport/manager.go(1 hunks)router/internal/retrytransport/retry_transport.go(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- router/internal/retrytransport/manager.go
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.637Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
📚 Learning: 2025-09-01T13:44:13.033Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.033Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false as the first check, so no external nil checks are needed when calling c.retryOptions.IsEnabled(). This is a consistent pattern used across multiple IsEnabled() methods in the codebase.
Applied to files:
router/internal/retrytransport/retry_transport.go
📚 Learning: 2025-09-01T13:44:13.033Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.033Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false, so no external nil checks are needed when calling c.retryOptions.IsEnabled().
Applied to files:
router/internal/retrytransport/retry_transport.go
📚 Learning: 2025-09-01T13:26:16.245Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/retry_builder.go:13-15
Timestamp: 2025-09-01T13:26:16.245Z
Learning: In router/core/retry_builder.go, the BuildRetryFunction's manager parameter is guaranteed to be non-nil by the calling code, so defensive nil checks are not needed according to the team's preference.
Applied to files:
router/internal/retrytransport/retry_transport.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.
Applied to files:
router/internal/retrytransport/retry_transport.go
🧬 Code graph analysis (1)
router/internal/retrytransport/retry_transport.go (3)
router/internal/retrytransport/manager.go (1)
Manager(34-40)router/internal/expr/expr.go (2)
Request(65-74)Context(34-38)router/internal/context/keys.go (1)
CurrentSubgraphContextKey(7-7)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: build-router
- GitHub Check: Analyze (go)
- GitHub Check: image_scan (nonroot)
- GitHub Check: build_push_image (nonroot)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: image_scan
- GitHub Check: build_push_image
- GitHub Check: integration_test (./events)
- GitHub Check: integration_test (./telemetry)
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: build_test
🔇 Additional comments (5)
router/internal/retrytransport/retry_transport.go (5)
16-20: Good: transport now composes roundTripper and retryManager cleanlyField injection improves testability and decoupling.
77-87: Constructor wiring looks solidSignature change and assignments are consistent.
139-142: OnRetry hook usage is correctAPI matches Manager.OnRetry(retries, req, resp, sleepDuration, err).
110-111: Usage correct: backoff.New expects (max, interval) signature
The argumentsretryOptions.MaxDuration(max) andretryOptions.Interval(interval) align withfunc New(max time.Duration, interval time.Duration) *Backoff; no changes needed.
96-103: VerifyCurrentSubgraphContextKeyvalue type
Ensure that any value stored underrcontext.CurrentSubgraphContextKeyis always astring; adjust the type assertion if other types are ever used.
StarpTech
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
router/core/graph_server.go (1)
281-283: Wrap retry manager init errors with context for faster triage.Propagate a contextualized error to distinguish init failures in logs.
Apply this diff:
- if err != nil { - return nil, err - } + if err != nil { + return nil, fmt.Errorf("retry manager initialization failure: %w", err) + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
router/core/graph_server.go(4 hunks)router/core/router.go(3 hunks)router/core/router_config.go(3 hunks)router/core/transport.go(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- router/core/router_config.go
- router/core/router.go
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.650Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
📚 Learning: 2025-09-01T13:26:16.284Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/retry_builder.go:13-15
Timestamp: 2025-09-01T13:26:16.284Z
Learning: In router/core/retry_builder.go, the BuildRetryFunction's manager parameter is guaranteed to be non-nil by the calling code, so defensive nil checks are not needed according to the team's preference.
Applied to files:
router/core/graph_server.go
🧬 Code graph analysis (2)
router/core/transport.go (2)
router/internal/retrytransport/manager.go (1)
Manager(33-39)router/internal/retrytransport/retry_transport.go (1)
NewRetryHTTPTransport(77-87)
router/core/graph_server.go (3)
router/internal/retrytransport/manager.go (3)
Manager(33-39)NewManager(41-48)OnRetryFunc(16-16)router/internal/expr/retry_expression.go (1)
NewRetryExpressionManager(19-23)router/core/retry_builder.go (1)
BuildRetryFunction(13-46)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: build_test
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (go)
- GitHub Check: integration_test (./events)
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: integration_test (./telemetry)
🔇 Additional comments (3)
router/core/graph_server.go (1)
269-286: Centralized retry manager wiring looks good; no nil defensive needed for BuildRetryFunction.Creating the RetryExpressionManager, building ShouldRetryFunc, and passing OnRetry is the right flow. Skipping a nil‑check on the expr manager aligns with the team’s preference noted in learnings.
router/core/transport.go (2)
351-352: Retry manager plumbing through factory/options/call site looks consistent.Also applies to: 372-373, 393-395, 439-441
73-75: Constructor signature changed — verify all call sites updatedThe constructor now replaces retryOptions with retryManager and the parameter order was changed; confirm every call site was updated to the new order/type to avoid misordered args. Search for usages (e.g. git grep -n "NewCustomTransport(" || rg -n --hidden -uu "NewCustomTransport\s*\(") and fix any mismatches.
d428949 to
de80950
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (6)
router/internal/retrytransport/retry_transport_test.go (1)
64-71: Enable retries in test manager to mirror prod behavior.RetryOptions.Enabled defaults to false; set it explicitly so tests don’t rely on downstream behavior. (Repeats a prior suggestion.)
func newTestManager(shouldRetry func(error, *http.Request, *http.Response) bool, onRetry OnRetryFunc, opts RetryOptions, subgraphName string) *Manager { mgr := NewManager(expr.NewRetryExpressionManager(), func(err error, req *http.Request, resp *http.Response, exprString string) bool { return shouldRetry(err, req, resp) }, onRetry) // attach options for the subgraph + opts.Enabled = true mgr.retries[subgraphName] = &opts return mgr }router/core/transport.go (1)
120-124: Nil-pointer concern addressed; keep as-is.retryManager.IsEnabled() is nil-receiver safe per codebase convention, so no guard is needed. This supersedes the earlier bot warning about a potential panic.
Consider a small test ensuring: with retries disabled (nil manager), NewCustomTransport picks baseRoundTripper and does not panic.
router-tests/retry_test.go (4)
495-496: Ditto: assert non-zero backoff for zero Retry-After case.- require.NotEqual(t, sleepDuration.Load(), int64(retryInterval)) + require.Greater(t, sleepDuration.Load(), int64(0)) + require.NotEqual(t, sleepDuration.Load(), int64(retryInterval))
534-559: “Disabled ignores algorithm” base test — nice.Matches team guidance; prevents regressions.
561-589: Per‑subgraph “disabled ignores algorithm” test — nice.Good mirror of the base case.
649-677: Same: relax error matching for per‑subgraph invalid expression.- require.ErrorContains(t, err, "failed to add retry expression for subgraph employees: failed to compile retry expression: line 1, column 0: unknown name truethere") + require.ErrorContains(t, err, "failed to add retry expression for subgraph employees") + require.ErrorContains(t, err, "failed to compile retry expression") + require.ErrorContains(t, err, "unknown name")
🧹 Nitpick comments (17)
router-tests/websocket_test.go (1)
709-719: Set algorithm explicitly when retries are enabled; consider a narrower expression.When Enabled=true, leaving Algorithm empty can trip validation depending on newRetryConfig. Also, "true" will retry on every failure (incl. 4xx), which can mask client/config errors in this test. Suggest setting Algorithm and using the default expression.
opts := core.NewSubgraphRetryOptions(config.TrafficShapingRules{ All: config.GlobalSubgraphRequestRule{ BackoffJitterRetry: config.BackoffJitterRetry{ Enabled: true, + Algorithm: "backoff_jitter", MaxAttempts: 5, MaxDuration: 10 * time.Second, Interval: 200 * time.Millisecond, - Expression: "true", + Expression: "IsRetryableStatusCode() || IsConnectionError() || IsTimeout()", }, }, })If unconditional retry is intentional here, keep Expression: "true" but still set Algorithm.
router-tests/circuit_breaker_test.go (2)
627-758: WebSocket conn leak on success path — close the client connection.After the successful subscription flow, the WebSocket client connection isn’t closed. Close it to avoid leaking fds in CI and to keep tests hermetic.
Apply:
@@ xEnv.WaitForSubscriptionCount(0, time.Second*5) + require.NoError(t, conn.Close()) }) })
709-709: Stabilize assertion: compare JSON payloads semantically, not byte-for-byte.Use JSONEq to avoid brittleness from spacing/ordering differences in error frames.
- require.Equal(t, defaultErrorMessage, string(message)) + require.JSONEq(t, defaultErrorMessage, string(message))router/internal/circuit/breaker.go (2)
30-39: Don’t unconditionally override context-derived subgraph; fall back only when override is non-empty.Right now the context value is computed and immediately discarded. Preserve it when the override returns an empty string to avoid routing “no subgraph” and silently bypassing breakers.
- subgraph = rt.getActiveSubgraphName(req) + if fn := rt.getActiveSubgraphName; fn != nil { + if sg := fn(req); sg != "" { + subgraph = sg + } + }
18-24: Defensive default for nil getActiveSubgraphName.A nil callback will panic on RoundTrip. Guard in the constructor with a safe default to the context-derived subgraph.
-func NewCircuitTripper(roundTripper http.RoundTripper, breaker *Manager, logger func(req *http.Request) *zap.Logger, getActiveSubgraphName func(req *http.Request) string) *Breaker { +func NewCircuitTripper(roundTripper http.RoundTripper, breaker *Manager, logger func(req *http.Request) *zap.Logger, getActiveSubgraphName func(req *http.Request) string) *Breaker { + if getActiveSubgraphName == nil { + getActiveSubgraphName = func(req *http.Request) string { + if sg, ok := req.Context().Value(rcontext.CurrentSubgraphContextKey{}).(string); ok { + return sg + } + return "" + } + } return &Breaker{ circuitBreaker: breaker, loggerFunc: logger, roundTripper: roundTripper, getActiveSubgraphName: getActiveSubgraphName, } }router/internal/retrytransport/retry_transport.go (1)
135-143: Respect request cancellation during backoff and reset request body before retries.
- Sleep should be ctx-aware to avoid hanging on canceled requests.
- For requests with bodies, reinitialize req.Body via GetBody before re-sending; otherwise retries after a consumed body can fail.
- // Wait for the specified duration - time.Sleep(sleepDuration) + // Wait for the specified duration, but respect cancellation + select { + case <-time.After(sleepDuration): + case <-req.Context().Done(): + return resp, req.Context().Err() + } // drain the previous response before retrying rt.drainBody(resp, requestLogger) // Retry the request - resp, err = rt.roundTripper.RoundTrip(req) + // Rewind body if present + if req.Body != nil { + if req.GetBody == nil { + requestLogger.Warn("Request has non-rewindable body; aborting retries") + return resp, err + } + if nb, gErr := req.GetBody(); gErr != nil { + requestLogger.Warn("Failed to reset request body", zap.Error(gErr)) + return resp, gErr + } else { + req.Body = nb + } + } + resp, err = rt.roundTripper.RoundTrip(req)router/core/transport.go (2)
87-89: Nil-safe logger fallback: good; tiny alloc nit.Returning zap.NewNop() per call is fine; you can cache it to avoid repeated allocs.
Apply:
- getRequestContextLogger := func(req *http.Request) *zap.Logger { + nopLogger := zap.NewNop() + getRequestContextLogger := func(req *http.Request) *zap.Logger { reqContext := getRequestContext(req.Context()) if reqContext == nil { - return zap.NewNop() + return nopLogger } return reqContext.Logger() }
382-383: TransportOptions adds RetryManager: LGTM; clarify nil is allowed.Optional: add a doc comment that nil means “retries disabled/default only.”
router-tests/retry_test.go (9)
223-233: Avoid coupling test behavior to OnRetryFunc invocation semantics.Using onRetryCounter to flip success implicitly relies on when OnRetryFunc fires. Prefer basing success on serviceCallsCounter (observable I/O), which is less brittle.
Apply:
- if onRetryCounter.Load() == int32(maxAttemptsBeforeServiceSucceeds) { + if serviceCallsCounter.Load() == int32(maxAttemptsBeforeServiceSucceeds) { w.WriteHeader(http.StatusOK) _, _ = w.Write([]byte(`{"data":{"employees":[{"id":1},{"id":2}]}}`)) } else { w.WriteHeader(http.StatusBadGateway) } serviceCallsCounter.Add(1)
297-306: Respecting Retry-After is correct; consider also covering HTTP-date format.Spec allows Retry-After as delta-seconds or an HTTP-date. Add a sibling subtest exercising the HTTP-date form to lock parser behavior.
Would you like me to draft that subtest?
367-373: Deflake timing bounds.The 20/100 ms margins may be tight under CI noise. Widen slightly to reduce flakes while preserving the assertion intent.
- shouldBeLessThanDuration := (time.Duration(maxRetryCount-1) * retryInterval) - (20 * time.Millisecond) - require.Less(t, requestDuration, shouldBeLessThanDuration) - - // We reduce by 100 for any jitter - expectedMinDuration := (time.Duration(maxRetryCount-1) * maxDuration) - (100 * time.Millisecond) - require.GreaterOrEqual(t, requestDuration, expectedMinDuration) + upperBound := (time.Duration(maxRetryCount-1) * retryInterval) - (50 * time.Millisecond) + require.Less(t, requestDuration, upperBound) + + // Allow more slack for jitter and scheduler variability + lowerBound := (time.Duration(maxRetryCount-1) * maxDuration) - (150 * time.Millisecond) + require.GreaterOrEqual(t, requestDuration, lowerBound)
430-435: Strengthen assertion: also assert non-zero backoff.Jitter may occasionally equal the nominal interval; NotEqual is weak. At minimum, assert it’s > 0 to ensure a backoff occurred.
- require.NotEqual(t, sleepDuration.Load(), int64(retryInterval)) + require.Greater(t, sleepDuration.Load(), int64(0)) + require.NotEqual(t, sleepDuration.Load(), int64(retryInterval))
621-647: Relax overly specific error substring.Error details like “line 1, column 0” can change. Assert the stable portions to avoid brittle tests.
- require.ErrorContains(t, err, "failed to add base retry expression: failed to compile retry expression: line 1, column 0: unknown name truethere") + require.ErrorContains(t, err, "failed to add base retry expression") + require.ErrorContains(t, err, "failed to compile retry expression") + require.ErrorContains(t, err, "unknown name")
757-763: Tidy: underscore unused handler param in middleware.Avoids linter noise and clarifies intent.
- Middleware: func(handler http.Handler) http.Handler { + Middleware: func(_ http.Handler) http.Handler {Also applies to: 835-841
931-959: Close the WS client connection with defer.Ensures cleanup on early failures.
- conn := xEnv.InitGraphQLWebSocketConnection(nil, nil, nil) + conn := xEnv.InitGraphQLWebSocketConnection(nil, nil, nil) + defer func() { _ = conn.Close() }() @@ - require.NoError(t, conn.Close()) + // Closed by deferAlso applies to: 997-1004, 1027-1029
865-929: Cover “All enabled + specific subgraph disabled” override.Add a subtest where All.Enabled=true but Subgraphs["test1"].Enabled=false, and assert test1 has zero retries while employees inherits All.
I can draft this subtest if you want it in this PR.
845-852: Prefer JSONEq for response assertions to reduce brittleness.
Replace raw string equality checks of JSON responses with require.JSONEq(t, expectedJSON, res.Body) to avoid failures from insignificant formatting/ordering changes.
Affected: router-tests/retry_test.go — lines 78, 129, 182, 241, 298, 362, 428, 489, 771, 781, 849, 859, 915, 925.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
router-tests/circuit_breaker_test.go(2 hunks)router-tests/retry_test.go(11 hunks)router-tests/websocket_test.go(2 hunks)router/core/router_config.go(2 hunks)router/core/transport.go(6 hunks)router/internal/circuit/breaker.go(2 hunks)router/internal/retrytransport/retry_transport.go(4 hunks)router/internal/retrytransport/retry_transport_test.go(17 hunks)router/internal/traceclient/traceclient.go(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- router/core/router_config.go
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.650Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
📚 Learning: 2025-09-03T11:38:45.855Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2183
File: router/internal/traceclient/traceclient.go:115-118
Timestamp: 2025-09-03T11:38:45.855Z
Learning: In the Cosmo codebase, the exprContext obtained from getExprContext in TraceInjectingRoundTripper.processConnectionMetrics is expected to always be non-nil as part of the existing system behavior, so defensive nil checks are not needed.
Applied to files:
router/internal/traceclient/traceclient.go
📚 Learning: 2025-09-01T13:26:16.284Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/retry_builder.go:13-15
Timestamp: 2025-09-01T13:26:16.284Z
Learning: In router/core/retry_builder.go, the BuildRetryFunction's manager parameter is guaranteed to be non-nil by the calling code, so defensive nil checks are not needed according to the team's preference.
Applied to files:
router/internal/retrytransport/retry_transport_test.gorouter/internal/retrytransport/retry_transport.gorouter/core/transport.go
📚 Learning: 2025-09-01T13:44:13.045Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.045Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false as the first check, so no external nil checks are needed when calling c.retryOptions.IsEnabled(). This is a consistent pattern used across multiple IsEnabled() methods in the codebase.
Applied to files:
router/internal/retrytransport/retry_transport_test.gorouter-tests/websocket_test.gorouter-tests/retry_test.gorouter/internal/retrytransport/retry_transport.gorouter/core/transport.go
📚 Learning: 2025-09-01T13:44:13.045Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.045Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false, so no external nil checks are needed when calling c.retryOptions.IsEnabled().
Applied to files:
router/internal/retrytransport/retry_transport_test.gorouter-tests/websocket_test.gorouter-tests/retry_test.gorouter/internal/retrytransport/retry_transport.gorouter/core/transport.go
📚 Learning: 2025-09-16T16:00:04.078Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/transport.go:0-0
Timestamp: 2025-09-16T16:00:04.078Z
Learning: The IsEnabled() method on *retrytransport.Manager in router/core/transport.go safely handles nil receivers by returning false, so no external nil checks are needed when calling retryManager.IsEnabled().
Applied to files:
router/internal/retrytransport/retry_transport_test.gorouter/internal/retrytransport/retry_transport.gorouter/core/transport.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.
Applied to files:
router-tests/websocket_test.gorouter-tests/retry_test.gorouter/internal/retrytransport/retry_transport.go
📚 Learning: 2025-08-28T09:18:10.121Z
Learnt from: endigma
PR: wundergraph/cosmo#2141
File: router-tests/http_subscriptions_test.go:100-108
Timestamp: 2025-08-28T09:18:10.121Z
Learning: In router-tests/http_subscriptions_test.go heartbeat tests, the message ordering should remain strict with data messages followed by heartbeat messages, as the timing is deterministic and known by design in the Cosmo router implementation.
Applied to files:
router-tests/circuit_breaker_test.gorouter-tests/retry_test.go
📚 Learning: 2025-08-28T10:07:27.650Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.650Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
Applied to files:
router-tests/retry_test.go
📚 Learning: 2025-08-26T17:19:28.178Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/internal/expr/retry_context.go:3-10
Timestamp: 2025-08-26T17:19:28.178Z
Learning: In the Cosmo router codebase, prefer using netErr.Timeout() checks over explicit context.DeadlineExceeded checks for timeout detection in retry logic, as Go's networking stack typically wraps context deadline exceeded errors in proper net.Error implementations.
Applied to files:
router-tests/retry_test.go
🧬 Code graph analysis (8)
router/internal/circuit/breaker.go (1)
router/internal/circuit/manager.go (1)
Manager(29-35)
router/internal/traceclient/traceclient.go (2)
router/pkg/metric/connection_metric_store.go (1)
ConnectionMetricStore(25-29)router/internal/expr/expr.go (4)
Context(35-39)Request(66-75)Subgraph(141-146)ClientTrace(135-138)
router/internal/retrytransport/retry_transport_test.go (3)
router/internal/retrytransport/manager.go (4)
OnRetryFunc(16-16)RetryOptions(24-31)Manager(33-39)NewManager(41-48)router/internal/expr/retry_expression.go (1)
NewRetryExpressionManager(19-23)router/internal/retrytransport/retry_transport.go (1)
NewRetryHTTPTransport(76-83)
router-tests/websocket_test.go (2)
router/core/router.go (2)
NewSubgraphRetryOptions(1933-1947)WithSubgraphRetryOptions(1801-1805)router/pkg/config/config.go (3)
TrafficShapingRules(164-171)GlobalSubgraphRequestRule(190-207)BackoffJitterRetry(231-238)
router-tests/circuit_breaker_test.go (2)
router-tests/testenv/testenv.go (10)
Run(107-124)Config(286-342)LogObservationConfig(388-391)SubgraphsConfig(367-380)SubgraphConfig(382-386)WSReadMessage(2656-2697)WSWriteMessage(2742-2783)Environment(1729-1765)WSWriteJSON(2785-2826)WebSocketMessage(2281-2285)router/pkg/config/config.go (2)
Config(987-1061)EngineExecutionConfiguration(371-400)
router-tests/retry_test.go (4)
router/core/router.go (3)
NewSubgraphRetryOptions(1933-1947)WithSubgraphRetryOptions(1801-1805)Option(172-172)router/pkg/config/config.go (4)
TrafficShapingRules(164-171)GlobalSubgraphRequestRule(190-207)BackoffJitterRetry(231-238)Config(987-1061)router/internal/retrytransport/manager.go (1)
OnRetryFunc(16-16)router-tests/testenv/testenv.go (6)
Run(107-124)RunWithError(137-152)Environment(1729-1765)WSReadMessage(2656-2697)WSWriteMessage(2742-2783)WSWriteJSON(2785-2826)
router/internal/retrytransport/retry_transport.go (2)
router/internal/retrytransport/manager.go (1)
Manager(33-39)router/internal/expr/expr.go (1)
Request(66-75)
router/core/transport.go (4)
router/internal/retrytransport/manager.go (1)
Manager(33-39)router/internal/traceclient/traceclient.go (1)
NewTraceInjectingRoundTripper(41-53)router/internal/circuit/breaker.go (1)
NewCircuitTripper(18-25)router/internal/retrytransport/retry_transport.go (1)
NewRetryHTTPTransport(76-83)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: build-router
- GitHub Check: image_scan
- GitHub Check: image_scan (nonroot)
- GitHub Check: build_push_image
- GitHub Check: integration_test (./events)
- GitHub Check: build_push_image (nonroot)
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: integration_test (./telemetry)
- GitHub Check: build_test
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (go)
🔇 Additional comments (13)
router-tests/websocket_test.go (1)
749-749: LGTM: exercising retry options in WS header/params forwarding test.Good to include retry wiring here to catch regressions in the WS upgrade path. Please confirm option ordering with WithHeaderRules is order‑insensitive in RouterOptions.
router/internal/traceclient/traceclient.go (1)
109-116: Confirmed — constructor is always passed non-nil callbacks.
Only call site is router/core/transport.go:113 — it passes the locally defined getExprContext and getActiveSubgraphName (defined immediately above), so nil checks are unnecessary.router/core/transport.go (7)
93-104: getActiveSubgraphName helper: LGTM.Gracefully handles missing context and subgraph; consistent with downstream uses.
105-114: Trace-injecting RT wiring: LGTM.Nil req-context returns empty expr.Context; signature matches traceclient.NewTraceInjectingRoundTripper.
116-118: Circuit breaker enriched with subgraph name: LGTM.Parameter order matches internal API; no behavior concerns.
361-366: TransportFactory gains retryManager field: LGTM.Plumbs manager through the factory cleanly.
403-405: NewTransport copies RetryManager: LGTM.
449-451: RoundTripper passes retryManager into NewCustomTransport: LGTM.
73-75: Approve: signature change to accept retryManager.
Only occurrences are the constructor and the single call at router/core/transport.go:443 — call site updated to pass retryManager.router-tests/retry_test.go (4)
17-24: Helper is concurrency-safe and minimal. LGTM.Captures retry count and duration without dereferencing resp/err; safe for nils.
39-52: Good adoption of NewSubgraphRetryOptions + WithSubgraphRetryOptions.The builder-style opts + OnRetryFunc wiring is clear and consistent.
709-741: Per‑subgraph attempts are validated cleanly. LGTM.Isolates employees/test1 and checks attempts = retries + 1.
787-818: Mixed All + per‑subgraph inheritance is covered. LGTM.Confirms global defaults apply to employees and per‑subgraph override for test1.
Also applies to: 820-863
de80950 to
f4090fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
router/internal/retrytransport/retry_transport.go (1)
134-136: Honor context cancellation during backoff sleep.Uninterruptible time.Sleep blocks cancellation. Use a timer and select on req.Context().Done().
- // Wait for the specified duration - time.Sleep(sleepDuration) + // Wait for the specified duration or until the request is canceled + if sleepDuration > 0 { + timer := time.NewTimer(sleepDuration) + select { + case <-req.Context().Done(): + timer.Stop() + return resp, req.Context().Err() + case <-timer.C: + } + }
♻️ Duplicate comments (3)
router-tests/retry_test.go (2)
534-559: LGTM and consistent with team guidance: disabled retries ignore algorithm.Matches prior guidance that algorithm is not validated when Enabled=false; good to lock with a test.
561-588: LGTM: per-subgraph “disabled ignores algorithm” case covered.Mirrors the base case; prevents regressions.
router/internal/retrytransport/manager.go (1)
50-118: Make Initialize thread-safe and avoid stale/partial state (atomic map swap).Initialize builds/assigns m.retries without any write lock, while readers use RLock. Also, repeated Initialize calls will leak stale entries (e.g., a subgraph disabled/removed later remains present). Build a new map, validate/fill it, then swap under a write lock.
Apply:
func (m *Manager) Initialize(baseRetryOptions RetryOptions, subgraphRetryOptions map[string]RetryOptions, routerConfig *nodev1.RouterConfig) error { + // Build into a fresh map to avoid partial/dirty reads and stale entries. + nextRetries := make(map[string]*RetryOptions) // Get the list of all subgraph AND feature subgraphs subgraphNameSet := make(map[string]bool, len(routerConfig.Subgraphs)) @@ - // Only assign default options if validation succeeds - for _, sgName := range defaultSgNames { - opts := baseRetryOptions - m.retries[sgName] = &opts - } + // Only assign default options if validation succeeds + for _, sgName := range defaultSgNames { + opts := baseRetryOptions // copy + nextRetries[sgName] = &opts + } } } } @@ - for _, sgName := range customSgNames { - entry, ok := subgraphRetryOptions[sgName] - if !ok { - return fmt.Errorf("failed to add base retry expression: %s", sgName) - } + for _, sgName := range customSgNames { + entry := subgraphRetryOptions[sgName] @@ - opts := entry - m.retries[sgName] = &opts + opts := entry // copy + nextRetries[sgName] = &opts } - return nil + // Atomic publish + m.lock.Lock() + m.retries = nextRetries + m.lock.Unlock() + return nil }
🧹 Nitpick comments (14)
router-tests/retry_test.go (5)
325-337: Fix flaky lower-bound assertion in “max duration is not exceeded” test.The lower-bound check assumes near‑max jitter per attempt; with full jitter it can be much smaller, causing flakes. Assert an upper bound against (retries-1)*MaxDuration instead.
Apply this diff:
- // We reduce by 100 for any jitter - expectedMinDuration := (time.Duration(maxRetryCount-1) * maxDuration) - (100 * time.Millisecond) - require.GreaterOrEqual(t, requestDuration, expectedMinDuration) + // Each sleep must not exceed MaxDuration; total must be <= (retries-1)*MaxDuration (+slack for processing) + maxAllowed := (time.Duration(maxRetryCount-1) * maxDuration) + (150 * time.Millisecond) + require.LessOrEqual(t, requestDuration, maxAllowed)Also applies to: 367-373
392-405: Strengthen jitter assertions for 429 without Retry-After.NotEqual to the base interval is fragile. Assert the jittered sleep is >0 and <= interval.
Apply this diff:
- require.NotEqual(t, sleepDuration.Load(), int64(retryInterval)) + require.Greater(t, time.Duration(sleepDuration.Load()), time.Duration(0)) + require.LessOrEqual(t, time.Duration(sleepDuration.Load()), retryInterval)Also applies to: 434-435
452-465: Mirror stronger jitter assertions for zero Retry-After case.Same rationale as above; assert 0 < sleep <= interval.
Apply this diff:
- require.NotEqual(t, sleepDuration.Load(), int64(retryInterval)) + require.Greater(t, time.Duration(sleepDuration.Load()), time.Duration(0)) + require.LessOrEqual(t, time.Duration(sleepDuration.Load()), retryInterval)Also applies to: 495-496
931-979: Rename duplicate test title to avoid confusion.Two subtests share the exact name “verify retries are applied when only all and no subgraph specific overrides are present”. Rename this one to reflect the FF scenario (e.g., “... under feature‑flagged subgraph (products_fg)”).
Apply this diff:
-t.Run("verify retries are applied when only all and no subgraph specific overrides are present", func(t *testing.T) { +t.Run("verify retries are applied with only 'all' config under feature-flagged subgraph (products_fg)", func(t *testing.T) {
865-929: Add an inheritance test: partial per-subgraph override should merge with All.We need to assert that unspecified per-subgraph fields inherit from All (e.g., Expression/Interval), as requested in review. Add a focused test that overrides only MaxAttempts for employees while inheriting Expression from All.
Apply this addition near this block:
+ t.Run("per-subgraph override inherits unspecified fields from all (expression + interval)", func(t *testing.T) { + t.Parallel() + + employeesCalls := atomic.Int32{} + test1Calls := atomic.Int32{} + + // All enables retries with expression targeting 502; employees overrides only MaxAttempts. + allMax := 4 + empMax := 2 + opts := core.NewSubgraphRetryOptions(config.TrafficShapingRules{ + All: config.GlobalSubgraphRequestRule{ + BackoffJitterRetry: config.BackoffJitterRetry{ + Enabled: true, + MaxAttempts: allMax, + MaxDuration: 2 * time.Second, + Interval: 10 * time.Millisecond, + Expression: "statusCode == 502", + }, + }, + Subgraphs: map[string]config.GlobalSubgraphRequestRule{ + "employees": { + BackoffJitterRetry: config.BackoffJitterRetry{ + Enabled: true, + MaxAttempts: empMax, // inherit Expression and Interval from All + }, + }, + }, + }) + options := core.WithSubgraphRetryOptions(opts) + + testenv.Run(t, &testenv.Config{ + NoRetryClient: true, + RouterOptions: []core.Option{options}, + Subgraphs: testenv.SubgraphsConfig{ + Employees: testenv.SubgraphConfig{ + Middleware: func(_ http.Handler) http.Handler { + return http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) { + w.WriteHeader(http.StatusBadGateway) // 502 matches All.Expression + employeesCalls.Add(1) + }) + }, + }, + Test1: testenv.SubgraphConfig{ + Middleware: func(_ http.Handler) http.Handler { + return http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) { + w.WriteHeader(http.StatusBadGateway) + test1Calls.Add(1) + }) + }, + }, + }, + }, func(t *testing.T, xEnv *testenv.Environment) { + // employees inherits Expression/Interval, but uses empMax attempts + resEmp, err := xEnv.MakeGraphQLRequest(testenv.GraphQLRequest{ + Query: `query employees { employees { id } }`, + }) + require.NoError(t, err) + require.Contains(t, resEmp.Body, `"statusCode":502`) + require.Equal(t, int32(empMax+1), employeesCalls.Load()) + + // test1 uses All config entirely + resTest1, err := xEnv.MakeGraphQLRequest(testenv.GraphQLRequest{ + Query: `query { floatField(arg: 1.5) }`, + }) + require.NoError(t, err) + require.Contains(t, resTest1.Body, `"statusCode":502`) + require.Equal(t, int32(allMax+1), test1Calls.Load()) + }) + })router/internal/retrytransport/manager.go (4)
102-104: Fix wrong algorithm value in error message.The check uses entry.Algorithm, but the error prints baseRetryOptions.Algorithm.
- return fmt.Errorf("unsupported retry algorithm for subgraph %s: %s", sgName, baseRetryOptions.Algorithm) + return fmt.Errorf("unsupported retry algorithm for subgraph %s: %s", sgName, entry.Algorithm)
97-101: Remove redundant map existence check and incorrect error text.customSgNames is derived from subgraphRetryOptions, so ok is guaranteed true. The current error text is also misleading.
- entry, ok := subgraphRetryOptions[sgName] - if !ok { - return fmt.Errorf("failed to add base retry expression: %s", sgName) - } + entry := subgraphRetryOptions[sgName]
50-62: Precondition: routerConfig must be non-nil.Function dereferences routerConfig; if it's ever nil (e.g., during hot‑reload failure), this panics. If callers guarantee non‑nil, document it; otherwise guard early.
Would you like me to add a fast precondition check and unit test?
120-132: Avoid exposing internal pointers or document immutability.GetSubgraphOptions returns a pointer from internal state. Either return a copy or clearly document that callers must treat the returned options as read‑only.
I can switch this to return a value (copy) with negligible overhead.
router/internal/retrytransport/retry_transport.go (5)
58-61: Support Retry-After on 503 (Service Unavailable) as well.RFC allows Retry‑After with 503 in addition to 429.
- if resp == nil || resp.StatusCode != http.StatusTooManyRequests { + if resp == nil || (resp.StatusCode != http.StatusTooManyRequests && resp.StatusCode != http.StatusServiceUnavailable) { return 0, false }
49-52: Lower log level for malformed Retry-After.Parsing failures are client‑visible only when we intend to honor it; Debug is more appropriate to avoid noisy Error logs.
- logger.Error("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin)) + logger.Debug("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin))
92-98: Guard nil getActiveSubgraph to avoid panic; default to empty name.- activeSubgraph := rt.getActiveSubgraph(req) + activeSubgraph := "" + if rt.getActiveSubgraph != nil { + activeSubgraph = rt.getActiveSubgraph(req) + } retryOptions := rt.retryManager.GetSubgraphOptions(activeSubgraph)
103-104: Default to a Nop logger when getRequestLogger is nil or returns nil.Prevents panics and keeps logs safe.
- requestLogger := rt.getRequestLogger(req) + requestLogger := zap.NewNop() + if rt.getRequestLogger != nil { + if l := rt.getRequestLogger(req); l != nil { + requestLogger = l + } + }
76-82: Provide a default RoundTripper when nil is passed.Avoid nil deref if callers forget to set it.
func NewRetryHTTPTransport(roundTripper http.RoundTripper, getRequestLogger requestLoggerGetter, retryManager *Manager, getActiveSubgraph func(req *http.Request) string) *RetryHTTPTransport { + if roundTripper == nil { + roundTripper = http.DefaultTransport + } return &RetryHTTPTransport{
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
router-tests/retry_test.go(11 hunks)router/core/graph_server.go(4 hunks)router/internal/retrytransport/manager.go(1 hunks)router/internal/retrytransport/retry_transport.go(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- router/core/graph_server.go
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.650Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
📚 Learning: 2025-09-01T13:26:16.284Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/retry_builder.go:13-15
Timestamp: 2025-09-01T13:26:16.284Z
Learning: In router/core/retry_builder.go, the BuildRetryFunction's manager parameter is guaranteed to be non-nil by the calling code, so defensive nil checks are not needed according to the team's preference.
Applied to files:
router/internal/retrytransport/manager.gorouter/internal/retrytransport/retry_transport.go
📚 Learning: 2025-09-16T16:00:04.078Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/transport.go:0-0
Timestamp: 2025-09-16T16:00:04.078Z
Learning: The IsEnabled() method on *retrytransport.Manager in router/core/transport.go safely handles nil receivers by returning false, so no external nil checks are needed when calling retryManager.IsEnabled().
Applied to files:
router/internal/retrytransport/manager.gorouter/internal/retrytransport/retry_transport.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.
Applied to files:
router/internal/retrytransport/manager.gorouter-tests/retry_test.gorouter/internal/retrytransport/retry_transport.go
📚 Learning: 2025-09-01T13:44:13.045Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.045Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false as the first check, so no external nil checks are needed when calling c.retryOptions.IsEnabled(). This is a consistent pattern used across multiple IsEnabled() methods in the codebase.
Applied to files:
router/internal/retrytransport/manager.gorouter-tests/retry_test.gorouter/internal/retrytransport/retry_transport.go
📚 Learning: 2025-08-28T10:07:27.650Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.650Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
Applied to files:
router/internal/retrytransport/manager.gorouter-tests/retry_test.go
📚 Learning: 2025-09-01T13:44:13.045Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.045Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false, so no external nil checks are needed when calling c.retryOptions.IsEnabled().
Applied to files:
router/internal/retrytransport/manager.gorouter-tests/retry_test.gorouter/internal/retrytransport/retry_transport.go
📚 Learning: 2025-08-26T17:19:28.178Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/internal/expr/retry_context.go:3-10
Timestamp: 2025-08-26T17:19:28.178Z
Learning: In the Cosmo router codebase, prefer using netErr.Timeout() checks over explicit context.DeadlineExceeded checks for timeout detection in retry logic, as Go's networking stack typically wraps context deadline exceeded errors in proper net.Error implementations.
Applied to files:
router-tests/retry_test.go
📚 Learning: 2025-08-28T09:18:10.121Z
Learnt from: endigma
PR: wundergraph/cosmo#2141
File: router-tests/http_subscriptions_test.go:100-108
Timestamp: 2025-08-28T09:18:10.121Z
Learning: In router-tests/http_subscriptions_test.go heartbeat tests, the message ordering should remain strict with data messages followed by heartbeat messages, as the timing is deterministic and known by design in the Cosmo router implementation.
Applied to files:
router-tests/retry_test.go
🧬 Code graph analysis (3)
router/internal/retrytransport/manager.go (2)
router/internal/expr/expr.go (1)
Request(66-75)router/internal/expr/retry_expression.go (1)
RetryExpressionManager(12-14)
router-tests/retry_test.go (4)
router/core/router.go (3)
NewSubgraphRetryOptions(1933-1947)WithSubgraphRetryOptions(1801-1805)Option(172-172)router/pkg/config/config.go (4)
TrafficShapingRules(164-171)GlobalSubgraphRequestRule(190-207)BackoffJitterRetry(231-238)Config(987-1061)router/internal/retrytransport/manager.go (1)
OnRetryFunc(16-16)router-tests/testenv/testenv.go (3)
Run(107-124)RunWithError(137-152)Environment(1729-1765)
router/internal/retrytransport/retry_transport.go (1)
router/internal/retrytransport/manager.go (1)
Manager(33-39)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: build-router
- GitHub Check: build_push_image
- GitHub Check: build_push_image (nonroot)
- GitHub Check: image_scan (nonroot)
- GitHub Check: image_scan
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: integration_test (./telemetry)
- GitHub Check: integration_test (./events)
- GitHub Check: build_test
- GitHub Check: build_test
- GitHub Check: Analyze (go)
🔇 Additional comments (18)
router-tests/retry_test.go (14)
39-51: LGTM: base retry config wiring is correct.Using NewSubgraphRetryOptions + WithSubgraphRetryOptions and OnRetryFunc in the mutation test is correct and aligns with the new API.
93-106: LGTM: expression=false disables retries as expected.Solid check that no retries occur when the expression evaluates to false.
146-158: LGTM: “all-fail” path counts are correct.Assertions for retry count and service calls (retries+1) are accurate.
200-212: LGTM: partial-fail then success scenario.Good use of OnRetry counter to flip behavior at N and verify final success.
261-273: LGTM: Retry-After (seconds) honored.Asserting sleepDuration equals header seconds is a good correctness check.
507-532: LGTM: invalid algorithm at base fails init.Error string check is precise; good.
593-619: LGTM: invalid algorithm per-subgraph fails init.Good negative coverage with clear error expectation.
624-647: LGTM: invalid base expression fails with actionable error.The asserted message is specific; great.
649-677: LGTM: invalid per-subgraph expression fails with subgraph context.Clear error surface including subgraph name.
682-707: LGTM: valid algorithm/expression happy-path init.Covers the positive path explicitly.
715-785: LGTM: per-subgraph retry counts enforced independently.Accurate expectations for employees vs test1.
796-863: LGTM: mixed All + per-subgraph overrides behave independently.Solid coverage for defaulting employees to All and overriding test1.
873-929: LGTM: All-only defaults applied to multiple subgraphs.Both subgraphs respect the global max attempts.
996-1079: LGTM: upgrade path retries fixed and validated.Good regression test for the websocket upgrade path; the counter expectation (failedTries+1) is spot on.
router/internal/retrytransport/manager.go (2)
134-143: LGTM: IsEnabled implementation is safe and clear.Simple length check under RLock is fine.
145-150: LGTM: Retry() nil-guard.Returning false when retryFunc is nil prevents panics.
router/internal/retrytransport/retry_transport.go (2)
85-91: LGTM: single-attempt fast path before retry evaluation.Early return on clean 2xx is consistent with your stated design.
153-171: LGTM: connection reuse via body drain is correct.The drain+close pattern is appropriate for keep‑alive reuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
router/internal/expr/retry_expression.go (1)
36-40: Default expression helpers likely not visible to the VM (pointer env + run with &ctx).Your default expression calls IsRetryableStatusCode()/IsConnectionError()/IsTimeout(), but the compile options only expose RetryContext as a value. If those helpers are pointer-receiver methods on RetryContext (common), they won’t be visible. Compile will fail or the functions will be unresolved.
- Use a pointer env in expr.Env.
- Run with &ctx to match the env.
- If helpers aren’t methods, register them via expr.Function(...) during compile.
Apply this diff:
@@ - options := []expr.Option{ - expr.Env(RetryContext{}), - expr.AsKind(reflect.Bool), - } + options := []expr.Option{ + // Pointer env so pointer‑receiver helpers on RetryContext are available. + expr.Env(&RetryContext{}), + expr.AsKind(reflect.Bool), + // If helpers are not methods on RetryContext, register them explicitly here via expr.Function(...). + } @@ - result, err := expr.Run(program, ctx) + // Run with pointer env to match expr.Env(&RetryContext{}). + result, err := expr.Run(program, &ctx)Also applies to: 69-69, 16-16
♻️ Duplicate comments (1)
router/core/transport.go (1)
120-122: Relying on nil-safe IsEnabled() is fine here.Acknowledged team convention that (*retrytransport.Manager).IsEnabled() handles nil receivers by returning false.
🧹 Nitpick comments (2)
router/internal/traceclient/traceclient.go (1)
117-122: Avoid emitting empty subgraph attribute.Skip WgSubgraphName when getActiveSubgraphName(req) returns "". Keeps metrics cardinality cleaner.
Apply:
- serverAttributes = append( - serverAttributes, - rotel.WgClientReusedConnection.Bool(trace.ConnectionAcquired.Reused), - rotel.WgSubgraphName.String(subgraph), - ) + serverAttributes = append( + serverAttributes, + rotel.WgClientReusedConnection.Bool(trace.ConnectionAcquired.Reused), + ) + if subgraph != "" { + serverAttributes = append(serverAttributes, rotel.WgSubgraphName.String(subgraph)) + }router/internal/expr/retry_expression.go (1)
25-31: Normalize keys; comment says “normalized” but no normalization is done.Trim whitespace so logically identical expressions dedupe to the same key.
Apply this diff:
@@ -func (m *RetryExpressionManager) AddExpression(exprString string) error { - expression := exprString +import ( + "fmt" + "reflect" + "strings" + "sync" +@@ +) +@@ +func (m *RetryExpressionManager) AddExpression(exprString string) error { + expression := strings.TrimSpace(exprString) @@ - // Use the normalized expression string as the key for deduplication + // Use the trimmed expression string as the key for deduplication m.expressionMap[expression] = programAlso trim in ShouldRetry:
- expression := expressionString + expression := strings.TrimSpace(expressionString)Also applies to: 47-49
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
router-tests/circuit_breaker_test.go(2 hunks)router-tests/retry_test.go(11 hunks)router/core/transport.go(6 hunks)router/internal/expr/retry_expression.go(2 hunks)router/internal/retrytransport/manager.go(1 hunks)router/internal/retrytransport/retry_transport_test.go(18 hunks)router/internal/traceclient/traceclient.go(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- router-tests/circuit_breaker_test.go
- router/internal/retrytransport/manager.go
- router/internal/retrytransport/retry_transport_test.go
- router-tests/retry_test.go
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.650Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
📚 Learning: 2025-09-03T11:38:45.855Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2183
File: router/internal/traceclient/traceclient.go:115-118
Timestamp: 2025-09-03T11:38:45.855Z
Learning: In the Cosmo codebase, the exprContext obtained from getExprContext in TraceInjectingRoundTripper.processConnectionMetrics is expected to always be non-nil as part of the existing system behavior, so defensive nil checks are not needed.
Applied to files:
router/internal/traceclient/traceclient.go
📚 Learning: 2025-09-01T13:26:16.284Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/retry_builder.go:13-15
Timestamp: 2025-09-01T13:26:16.284Z
Learning: In router/core/retry_builder.go, the BuildRetryFunction's manager parameter is guaranteed to be non-nil by the calling code, so defensive nil checks are not needed according to the team's preference.
Applied to files:
router/core/transport.go
📚 Learning: 2025-09-16T16:00:04.078Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/transport.go:0-0
Timestamp: 2025-09-16T16:00:04.078Z
Learning: The IsEnabled() method on *retrytransport.Manager in router/core/transport.go safely handles nil receivers by returning false, so no external nil checks are needed when calling retryManager.IsEnabled().
Applied to files:
router/core/transport.go
📚 Learning: 2025-09-01T13:44:13.045Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.045Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false as the first check, so no external nil checks are needed when calling c.retryOptions.IsEnabled(). This is a consistent pattern used across multiple IsEnabled() methods in the codebase.
Applied to files:
router/core/transport.go
📚 Learning: 2025-09-01T13:44:13.045Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.045Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false, so no external nil checks are needed when calling c.retryOptions.IsEnabled().
Applied to files:
router/core/transport.go
🧬 Code graph analysis (3)
router/internal/traceclient/traceclient.go (2)
router/pkg/metric/connection_metric_store.go (1)
ConnectionMetricStore(25-29)router/internal/expr/expr.go (4)
Context(35-39)Request(66-75)Subgraph(141-146)ClientTrace(135-138)
router/core/transport.go (4)
router/internal/retrytransport/manager.go (1)
Manager(33-39)router/internal/traceclient/traceclient.go (1)
NewTraceInjectingRoundTripper(41-53)router/internal/circuit/breaker.go (1)
NewCircuitTripper(17-24)router/internal/retrytransport/retry_transport.go (1)
NewRetryHTTPTransport(76-83)
router/internal/expr/retry_expression.go (1)
router/internal/expr/retry_context.go (1)
RetryContext(13-19)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: build-router
- GitHub Check: build_test
- GitHub Check: integration_test (./telemetry)
- GitHub Check: build_push_image (nonroot)
- GitHub Check: image_scan
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: integration_test (./events)
- GitHub Check: image_scan (nonroot)
- GitHub Check: build_push_image
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (go)
🔇 Additional comments (10)
router/internal/traceclient/traceclient.go (2)
35-38: Good DI split: expr context vs. subgraph name.Separating these callbacks makes the round-tripper reusable and easier to test. Constructor wiring looks consistent.
Also applies to: 41-46, 48-52
109-114: Confirm expr path: Subgraph.Request.ClientTrace.Please verify this field path matches expr.Context schema used elsewhere (some code paths historically used Request.Trace.*). If this struct layout changed in this PR, ensure all evaluators reference the new location.
router/core/transport.go (7)
85-91: No-op logger fallback is sensible.Returning zap.NewNop() when request context is absent avoids panics in downstream wrappers.
106-114: Expr context getter: safe default.Returning &expr.Context{} when req context is nil avoids nil deref in the trace round-tripper while keeping behavior predictable.
116-118: Circuit tripper now subgraph-aware.Passing getActiveSubgraphName aligns breaker metrics/keys with per-subgraph behavior.
361-383: Options/Factory wiring for RetryManager looks consistent.Field propagation is straightforward; no extraneous state carried over from the removed RetryOptions.
403-405: Factory initialization passes RetryManager correctly.No concerns.
449-451: Transport constructed with RetryManager (and subgraph-aware wrappers).End-to-end wiring looks correct.
93-104: Subgraph name resolver — verify Upgrade/WS/SSE pathsAuto-verification failed: router-tests/tests not found. Confirm ActiveSubgraph(req) is populated for Upgrade (WebSocket) and SSE (text/event-stream) requests so circuit-breaker/retry get a meaningful subgraph; otherwise add a fallback (e.g., from engine hooks or a request-context key). File: router/core/transport.go:93-104
router/internal/expr/retry_expression.go (1)
25-31: No change needed — AddExpression only called for enabled retries.Initialize() in router/internal/retrytransport/manager.go filters subgraphs into defaultSgNames/customSgNames and only calls m.exprManager.AddExpression(...) when baseRetryOptions.Enabled or entry.Enabled, so disabled subgraphs won’t trigger compilation of the defaultRetryExpression.
|
@coderabbitai pause |
✅ Actions performedReviews paused. |
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
|
@StarpTech Could I get a re-review (remaining on my side to fix circuit breaker test) |
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Fixes ENG-7966
This PR adds the option for users to set retry settings per subgraph. If a subgraph retry is specified for all, it will be used by default, with any subgraph overrides either disabling or changing the settings per subgraph. Example configuration
This PR also adds a fix for upgrade requests for circuit breakers, as the engine hooks context key with the subgraph name is not set on upgrade requests.
Summary by CodeRabbit
New Features
Refactor
Tests
Checklist