Skip to content

Conversation

@SkArchon
Copy link
Contributor

@SkArchon SkArchon commented Aug 27, 2025

Fixes ENG-7966

This PR adds the option for users to set retry settings per subgraph. If a subgraph retry is specified for all, it will be used by default, with any subgraph overrides either disabling or changing the settings per subgraph. Example configuration

traffic_shaping:
  all:
    retry:
      max_attempts: 5
  subgraphs:
    employee:
      retry:
        interval: 5s
        max_attempts: 7

This PR also adds a fix for upgrade requests for circuit breakers, as the engine hooks context key with the subgraph name is not set on upgrade requests.

Summary by CodeRabbit

  • New Features

    • Per-subgraph retry controls with global defaults, expression-driven policies, Retry-After handling, and retry callbacks; WebSocket/upgrade paths now exercise retry behavior.
  • Refactor

    • Centralized retry manager and expression manager; retry logic unified across transport, executor, circuit breaker and tracing with a builder-style configuration API and per-subgraph awareness.
  • Tests

    • Expanded coverage for per-subgraph retries, expression evaluation, Retry-After variants, retry counting, panic/error propagation, and circuit-breaker upgrade flows (one test duplicated).

Checklist

  • I have discussed my proposed changes in an issue and have received approval to proceed.
  • I have followed the coding standards of the project.
  • Tests or benchmarks have been added or updated.
  • Documentation has been updated on https://github.com/wundergraph/cosmo-docs.
  • I have read the Contributors Guide.

@coderabbitai
Copy link

coderabbitai bot commented Aug 27, 2025

Note

Reviews paused

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Walkthrough

Replace primitive multi-argument subgraph-retry configuration with a structured, per-subgraph, expression-driven system: add SubgraphRetryOptions/NewSubgraphRetryOptions, a RetryExpressionManager, and an internal retry Manager; wire them into transports, graph server, router config, circuit breaker, traceclient, and tests.

Changes

Cohort / File(s) Summary of changes
Tests: migrate to typed retry options
router-tests/error_handling_test.go, router-tests/panic_test.go, router-tests/structured_logging_test.go, router-tests/telemetry/telemetry_test.go
Replace multi-arg WithSubgraphRetryOptions(...) calls with core.NewSubgraphRetryOptions(config.TrafficShapingRules{...}) and WithSubgraphRetryOptions(opts); use structured TrafficShapingRules (e.g., BackoffJitterRetry.Enabled=false).
Retry tests: builder usage & per-subgraph coverage
router-tests/retry_test.go
Migrate tests to construct SubgraphRetryOptions via NewSubgraphRetryOptions(...), set OnRetryFunc, pass to WithSubgraphRetryOptions(opts); add TestRetryPerSubgraph exercising global vs per-subgraph rules, invalid-algorithm handling, per-subgraph counters, and websocket subscription cases.
WebSocket & circuit-breaker tests
router-tests/websocket_test.go, router-tests/circuit_breaker_test.go
Inject typed SubgraphRetryOptions into websocket tests; add circuit-breaker WebSocket upgrade-path test (uses github.com/gorilla/websocket) — duplicated test copies present in diff.
Router public API & config
router/core/router.go, router/core/router_config.go, router/core/supervisor_instance.go
Add exported SubgraphRetryOptions (All, SubgraphMap, OnRetryFunc) and IsEnabled(); replace old multi-arg WithSubgraphRetryOptions with WithSubgraphRetryOptions(*SubgraphRetryOptions); add NewSubgraphRetryOptions(cfg config.TrafficShapingRules); Config now stores *SubgraphRetryOptions.
Graph server: manager initialization & injection
router/core/graph_server.go
Create RetryExpressionManager, build retry function with BuildRetryFunction, instantiate and initialize retrytransport.Manager, store on server and pass RetryManager into ExecutorConfigurationBuilder.
Retry builder & tests: expression-manager-driven
router/core/retry_builder.go, router/core/retry_builder_test.go
Add BuildRetryFunction(manager *expr.RetryExpressionManager) returning a retry closure that accepts an expression string; add mutation guard and EOF default-retry logic; update tests to use expression manager and explicit expression parameter.
Transport core: API and wiring changes
router/core/transport.go
Replace retry-options parameters with retryManager *retrytransport.Manager; update NewCustomTransport, TransportFactory, and TransportOptions to accept/store a manager and pass it to NewRetryHTTPTransport.
Expression engine: multi-expression manager
router/internal/expr/retry_expression.go, router/internal/expr/retry_expression_test.go
Replace single compiled program with an expression map; add NewRetryExpressionManager() and AddExpression(exprString) error; change ShouldRetry(ctx, expressionString) signature and update tests to add/evaluate named expressions.
Retry manager: per-subgraph manager implementation
router/internal/retrytransport/manager.go
Add Manager type with per-subgraph RetryOptions, exprManager, ShouldRetry delegate and OnRetry hook; provide NewManager, Initialize, GetSubgraphOptions, IsEnabled, and Retry methods; validate algorithms and register expressions.
Retry HTTP transport & tests: manager-driven runtime logic
router/internal/retrytransport/retry_transport.go, router/internal/retrytransport/retry_transport_test.go
Switch transport to use *Manager for per-subgraph options; add getActiveSubgraph resolution; implement per-subgraph retry loops bounded by MaxRetryCount; add Retry-After parsing/handling and OnRetry hook integration; update constructor to NewRetryHTTPTransport(roundTripper, getRequestLogger, retryManager, getActiveSubgraph) and adapt tests.
Circuit, traceclient, trace-injection callbacks
router/internal/circuit/breaker.go, router/internal/traceclient/traceclient.go
Add getActiveSubgraphName and separate getExprContext callbacks; update constructors and RoundTrip/metrics to use callbacks instead of reading subgraph from request context.
Engine loader & context keys cleanup
router/core/engine_loader_hooks.go, router/internal/context/keys.go
Remove injecting current subgraph into request context in OnLoad; delete CurrentSubgraphContextKey type and remove context-key-based subgraph propagation.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "feat: specify retry configuration overrides per subgraph" is concise and accurately summarizes the PR’s primary change—adding per-subgraph retry override support—using a conventional "feat:" prefix and clear wording that a teammate scanning history will understand.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Aug 27, 2025

Router image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-8189260d47f1d53ff19fd4cd1d17203f1ac72fa5

@SkArchon SkArchon changed the base branch from main to dustin/eng-7974-allow-to-configure-retry-condition August 27, 2025 15:14
Base automatically changed from dustin/eng-7974-allow-to-configure-retry-condition to main August 28, 2025 11:57
@SkArchon SkArchon force-pushed the milinda/retry-per-subgraphs branch from f6f2adc to ef2a9a6 Compare September 1, 2025 11:22
@SkArchon SkArchon force-pushed the milinda/retry-per-subgraphs branch from 8a7da2f to d86e5cc Compare September 1, 2025 12:53
@SkArchon SkArchon marked this pull request as ready for review September 1, 2025 13:01
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
router/internal/retrytransport/retry_transport.go (1)

142-144: Honor context cancellation during backoff sleep.

Abort retries promptly if the request context is canceled.

-		// Wait for the specified duration
-		time.Sleep(sleepDuration)
+		// Wait for the specified duration or abort if canceled
+		select {
+		case <-req.Context().Done():
+			return resp, req.Context().Err()
+		case <-time.After(sleepDuration):
+		}
🧹 Nitpick comments (25)
router-tests/structured_logging_test.go (1)

712-712: Consistent migration to object-based retry options in tests

All updated WithSubgraphRetryOptions(core.NewSubgraphRetryOptions(... Enabled: false ...)) calls look correct and keep logs deterministic by avoiding subgraph retries.

If you find yourself disabling retries across many tests, consider a small helper in testenv (e.g., testenv.DisableSubgraphRetriesOption()) to reduce repetition and ease future changes.

Also applies to: 832-832, 964-964, 1100-1100, 2214-2214

router-tests/telemetry/telemetry_test.go (2)

8695-8700: Per-subgraph retry builder usage looks correct; optionally assert “no retries” explicitly

Switching to core.WithSubgraphRetryOptions(core.NewSubgraphRetryOptions(...)) with Enabled: false matches the new API and aligns with our prior learning that algorithms shouldn’t be validated when retries are disabled. If you want a stronger assertion, consider wiring an OnRetryFunc (if exposed on SubgraphRetryOptions) to increment a counter and assert it remains zero during this test.


9639-9642: Check all datapoints, not just the first, for absence of the subgraph attribute

Guard against future exporter changes that may add more datapoints by asserting across the full set.

-					data2 := httpRequestsMetric.Data.(metricdata.Sum[int64])
-					atts := data2.DataPoints[0].Attributes
-					_, ok := atts.Value(attribute.Key(key))
-					require.False(t, ok)
+					data2 := httpRequestsMetric.Data.(metricdata.Sum[int64])
+					for _, dp := range data2.DataPoints {
+						_, ok := dp.Attributes.Value(attribute.Key(key))
+						require.False(t, ok)
+					}
router-tests/error_handling_test.go (1)

1383-1387: Good migration to manager-based retry config; consider a helper to DRY tests

The switch to core.WithSubgraphRetryOptions(core.NewSubgraphRetryOptions(...)) with Enabled:false is correct and matches our prior learning that disabled retries shouldn’t validate algorithm/fields. You repeat the same disabled config 4x; a tiny helper keeps tests concise.

Apply this to replace the repeated blocks:

-                core.WithSubgraphRetryOptions(core.NewSubgraphRetryOptions(config.TrafficShapingRules{
-                    All: config.GlobalSubgraphRequestRule{
-                        BackoffJitterRetry: config.BackoffJitterRetry{Enabled: false},
-                    },
-                })),
+                disabledSubgraphRetries(),

Add this helper near the top of the file (outside the selected ranges):

// test helper
func disabledSubgraphRetries() core.Option {
	return core.WithSubgraphRetryOptions(core.NewSubgraphRetryOptions(config.TrafficShapingRules{
		All: config.GlobalSubgraphRequestRule{
			BackoffJitterRetry: config.BackoffJitterRetry{Enabled: false},
		},
	}))
}

Also applies to: 1419-1423, 1455-1459, 1491-1495

router/core/graph_server.go (1)

266-287: router/core/graph_server.go: guard nil retryOptions and wrap Initialize error

-    if s.retryOptions.IsEnabled() {
+    if s.retryOptions != nil && s.retryOptions.IsEnabled() {
         retryExprManager := expr.NewRetryExpressionManager()
         // …
         err = retryManager.Initialize(
             s.retryOptions.All,
             s.retryOptions.SubgraphMap,
             routerConfig.GetSubgraphs(),
         )
-        if err != nil {
-            return nil, err
-        }
+        if err != nil {
+            return nil, fmt.Errorf("failed to initialize retry manager: %w", err)
+        }
         s.retryManager = retryManager
     }
router/internal/expr/retry_expression.go (3)

16-23: Constructor is fine; optional: precompile default expression

As a small UX improvement, consider precompiling the default expression during construction so first evaluations don’t depend on prior AddExpression calls.

Example (if you decide to change signature elsewhere to allow error return):

m := &RetryExpressionManager{expressionMap: make(map[string]*vm.Program)}
_ = m.AddExpression("") // ignore error; Initialize can still validate real strings
return m

25-50: Normalize expression key or fix the comment

Comment says “normalized” but no normalization occurs. Either trim/normalize or update the comment.

 import (
     "fmt"
     "reflect"
+    "strings"

     "github.com/expr-lang/expr"
     "github.com/expr-lang/expr/vm"
 )
@@
-func (c *RetryExpressionManager) AddExpression(exprString string) error {
-    expression := exprString
+func (c *RetryExpressionManager) AddExpression(exprString string) error {
+    expression := strings.TrimSpace(exprString)
@@
-    // Use the normalized expression string as the key for deduplication
+    // Use the trimmed expression string as the key for deduplication
     c.expressionMap[expression] = program
     return nil
 }

53-80: Consider lazy compilation on cache miss instead of silently returning false

Fail-closed is safe, but lazily compiling unknown expressions can reduce surprises if Initialize missed one.

 func (m *RetryExpressionManager) ShouldRetry(ctx RetryContext, expressionString string) (bool, error) {
     if m == nil {
         return false, nil
     }

-    expression := expressionString
+    expression := expressionString
     if expression == "" {
         expression = defaultRetryExpression
     }

     program, ok := m.expressionMap[expression]
     if !ok {
-        // If the expression wasn't pre-compiled, do not retry by default
-        return false, nil
+        // Lazily compile the expression once
+        if err := m.AddExpression(expression); err != nil {
+            return false, err
+        }
+        program = m.expressionMap[expression]
     }

If AddExpression can be invoked concurrently in your environment, add an RWMutex around expressionMap accesses.

Would you like me to add a minimal, lock-guarded version?

router/internal/expr/retry_expression_test.go (2)

141-146: Strengthen the empty-expression test

Consider asserting runtime behavior as well (e.g., ShouldRetry returns false) to guard against regressions, not just that AddExpression("") succeeds and the manager is non-nil.

 err := manager.AddExpression("")
 assert.NoError(t, err)
 assert.NotNil(t, manager)
+res, evalErr := manager.ShouldRetry(RetryContext{}, "")
+assert.NoError(t, evalErr)
+assert.False(t, res)

528-551: Parallelize subtests to speed up and isolate failures (optional)

You can mark the top-level test and subtests with t.Parallel() for faster runs; table-driven subtests here are good candidates.

 func TestRetryExpressionManager_WithSyscallErrors(t *testing.T) {
+  t.Parallel()
   t.Run("expression combining status and syscall errors", func(t *testing.T) {
+    t.Parallel()
router/core/router.go (3)

187-206: SubgraphRetryOptions: shape and IsEnabled() look good

Structure and enablement logic are sound; nil map iteration is safe. Minor: consider doc comments on fields for clarity (global “All” vs per-subgraph overrides).


1908-1923: Comment mismatch: “circuit breakers” → “retry options”

The comment under NewSubgraphRetryOptions mentions circuit breakers; it should refer to retry options.

- // Subgraph specific circuit breakers
+ // Subgraph-specific retry options

1924-1937: Default algorithm only when retries are enabled (aligns with team preference)

Per team learning, when retries are disabled we shouldn’t “enforce” algorithm defaults. Guard the defaulting by Enabled to better reflect that stance.

-func newRetryConfig(config config.BackoffJitterRetry) retrytransport.RetryOptions {
-    algorithm := config.Algorithm
-    if algorithm == "" {
-        algorithm = retrytransport.BackoffJitter
-    }
+func newRetryConfig(config config.BackoffJitterRetry) retrytransport.RetryOptions {
+    algorithm := config.Algorithm
+    if config.Enabled && algorithm == "" {
+        algorithm = retrytransport.BackoffJitter
+    }
     return retrytransport.RetryOptions{
         Enabled:       config.Enabled,
         Algorithm:     algorithm,
         MaxRetryCount: config.MaxAttempts,
         MaxDuration:   config.MaxDuration,
         Interval:      config.Interval,
         Expression:    config.Expression,
     }
 }
router/core/retry_builder.go (1)

15-23: Defensive nil check for req (optional)

If ShouldRetry ever gets a nil req from the transport layer, req.Context() will panic. Cheap guard:

-reqContext := getRequestContext(req.Context())
+if req == nil {
+  return false
+}
+reqContext := getRequestContext(req.Context())
router-tests/retry_test.go (1)

76-78: Avoid brittle full-JSON equality in assertions.

These assertions will break on harmless message changes. Prefer checking essential fields (e.g., statusCode) or comparing decoded JSON structures.
Example:

- require.Equal(t, `{"errors":[{"message":"Failed to fetch from Subgraph 'employees', Reason: empty response.","extensions":{"statusCode":502}}],"data":{"employees":null}}`, res.Body)
+ require.JSONEq(t, `{"data":{"employees":null}}`, res.Body)
+ require.Contains(t, res.Body, `"statusCode":502`)

Also applies to: 127-129, 180-182, 296-298, 360-372

router/internal/retrytransport/retry_transport_test.go (2)

268-298: Add a zero-retries guard test.

Add a case with MaxRetryCount=0 to assert no retries occur and OnRetry isn’t called.

Sketch:

t.Run("no retries when MaxRetryCount=0", func(t *testing.T) {
  retries, attempts := 0, 0
  mgr := newTestManager(simpleShouldRetry, func(int, *http.Request, *http.Response, time.Duration, error) { retries++ },
    RetryOptions{Enabled: true, MaxRetryCount: 0, Interval: time.Millisecond, MaxDuration: time.Second}, "sg")
  tr := NewRetryHTTPTransport(&MockTransport{handler: func(*http.Request) (*http.Response, error) {
    attempts++; return nil, errors.New("fail")
  }}, func(*http.Request) *zap.Logger { return zap.NewNop() }, mgr)
  _, err := tr.RoundTrip(withSubgraph(httptest.NewRequest("GET", "http://x", nil), "sg"))
  assert.Error(t, err)
  assert.Equal(t, 0, retries)
  assert.Equal(t, 1, attempts)
})

383-414: Log-string assertions are inherently brittle.

If wording changes, tests fail. Consider matching on a stable key/field or use zap observer to assert fields instead of formatted strings.

router/core/retry_builder_test.go (4)

52-57: Test name doesn’t match behavior.

No “disabled” path is exercised here; the test only builds the function. Rename to “build function returns non-nil with empty manager” or add a real disabled-case.

-t.Run("build function when retry is disabled", func(t *testing.T) {
+t.Run("build function returns non-nil with empty manager", func(t *testing.T) {

176-189: Clarify EOF comment.

Comment says “EOF is now handled at transport layer,” but this test expects true here via default retryable error handling. Tweak the comment to avoid confusion.

- // EOF is now handled at transport layer, not expression
+ // EOF can be retried via default transport-level detection; expression is false here

442-485: Remove stale, commented-out tests.

Dead code makes maintenance harder. Either port to the new manager-based API or delete.

-// TODO: Fix
-//func TestProcessRetryOptions(t *testing.T) {
-//  ... (entire commented block)
-//}
+// (removed legacy ProcessRetryOptions tests; superseded by expression-manager and manager-based flows)

51-57: Add a “no request context” guard test.

BuildRetryFunction returns false when request context is missing. Add a unit test to lock this in.

Sketch:

t.Run("returns false when request context missing", func(t *testing.T) {
  manager := expr.NewRetryExpressionManager()
  _ = manager.AddExpression("true")
  fn, _ := BuildRetryFunction(manager)
  req, _ := http.NewRequest("POST", "http://example.com/graphql", nil) // no ctx attached
  assert.False(t, fn(nil, req, &http.Response{StatusCode: 500}, "true"))
})

Want me to open a follow-up PR adding this?

Also applies to: 139-146

router/internal/retrytransport/manager.go (2)

88-91: Remove unreachable config-not-found branch.

customSgNames is only populated when a subgraph has an entry in subgraphRetryOptions; this branch cannot trigger.

-		entry, ok := subgraphRetryOptions[sgName]
-		if !ok {
-			joinErr = errors.Join(joinErr, errors.New("retry config not found for subgraph "+sgName))
-			continue
-		}
+		entry := subgraphRetryOptions[sgName]

118-121: Rename local variable for clarity.

Avoid “circuitBreaker” in retry code.

-	if circuitBreaker, ok := c.retries[name]; ok {
-		return circuitBreaker
+	if opts, ok := c.retries[name]; ok {
+		return opts
router/internal/retrytransport/retry_transport.go (2)

145-151: Option: drain before sleeping to maximize connection reuse.

Draining immediately frees the connection for reuse while we wait.

-		// drain the previous response before retrying
-		rt.drainBody(resp, requestLogger)
-
-		// Retry the request
-		resp, err = rt.roundTripper.RoundTrip(req)
+		// drain the previous response before waiting/retrying
+		rt.drainBody(resp, requestLogger)
+		// Retry the request
+		resp, err = rt.roundTripper.RoundTrip(req)

50-56: Log level nit: malformed Retry-After.

Consider Warn instead of Error to reduce noise for client-driven header issues.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 67c417f and d86e5cc.

📒 Files selected for processing (17)
  • router-tests/error_handling_test.go (4 hunks)
  • router-tests/panic_test.go (2 hunks)
  • router-tests/retry_test.go (10 hunks)
  • router-tests/structured_logging_test.go (5 hunks)
  • router-tests/telemetry/telemetry_test.go (2 hunks)
  • router/core/graph_server.go (4 hunks)
  • router/core/retry_builder.go (2 hunks)
  • router/core/retry_builder_test.go (15 hunks)
  • router/core/router.go (3 hunks)
  • router/core/router_config.go (2 hunks)
  • router/core/supervisor_instance.go (1 hunks)
  • router/core/transport.go (6 hunks)
  • router/internal/expr/retry_expression.go (2 hunks)
  • router/internal/expr/retry_expression_test.go (2 hunks)
  • router/internal/retrytransport/manager.go (1 hunks)
  • router/internal/retrytransport/retry_transport.go (4 hunks)
  • router/internal/retrytransport/retry_transport_test.go (33 hunks)
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.637Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/core/supervisor_instance.go
  • router/core/router.go
  • router-tests/telemetry/telemetry_test.go
  • router/core/router_config.go
  • router-tests/retry_test.go
  • router-tests/error_handling_test.go
  • router-tests/structured_logging_test.go
  • router-tests/panic_test.go
📚 Learning: 2025-08-28T10:07:27.637Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.637Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.

Applied to files:

  • router/core/supervisor_instance.go
  • router/core/router.go
  • router-tests/retry_test.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.

Applied to files:

  • router-tests/telemetry/telemetry_test.go
  • router-tests/error_handling_test.go
  • router-tests/structured_logging_test.go
  • router-tests/panic_test.go
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

  • router-tests/telemetry/telemetry_test.go
📚 Learning: 2025-08-26T17:30:41.212Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/core/retry_expression_builder.go:3-11
Timestamp: 2025-08-26T17:30:41.212Z
Learning: In router/core/retry_expression_builder.go, the team prefers to keep EOF error detection simple with string matching ("unexpected eof") rather than using errors.Is() for io.EOF and io.ErrUnexpectedEOF. They want to wait for customer feedback before making it more robust, since customers can modify retry expressions to handle specific cases.

Applied to files:

  • router/internal/expr/retry_expression_test.go
  • router/core/retry_builder_test.go
  • router/core/retry_builder.go
📚 Learning: 2025-08-28T09:59:19.956Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/core/supervisor_instance.go:196-204
Timestamp: 2025-08-28T09:59:19.956Z
Learning: In router/pkg/config/config.go, the BackoffJitterRetry struct uses envDefault tags to provide default values for Algorithm ("backoff_jitter") and Expression ("IsRetryableStatusCode() || IsConnectionError() || IsTimeout()"), ensuring that even when YAML configs omit these fields, valid defaults are applied automatically, preventing ProcessRetryOptions from rejecting empty values.

Applied to files:

  • router/core/retry_builder.go
📚 Learning: 2025-08-26T17:19:28.178Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/internal/expr/retry_context.go:3-10
Timestamp: 2025-08-26T17:19:28.178Z
Learning: In the Cosmo router codebase, prefer using netErr.Timeout() checks over explicit context.DeadlineExceeded checks for timeout detection in retry logic, as Go's networking stack typically wraps context deadline exceeded errors in proper net.Error implementations.

Applied to files:

  • router/internal/retrytransport/retry_transport_test.go
🧬 Code graph analysis (17)
router/core/supervisor_instance.go (1)
router/core/router.go (2)
  • WithSubgraphRetryOptions (1776-1780)
  • NewSubgraphRetryOptions (1908-1922)
router/core/router.go (2)
router/internal/retrytransport/manager.go (3)
  • RetryOptions (25-32)
  • OnRetryFunc (17-17)
  • BackoffJitter (22-22)
router/pkg/config/config.go (2)
  • TrafficShapingRules (164-171)
  • BackoffJitterRetry (231-238)
router-tests/telemetry/telemetry_test.go (2)
router/core/router.go (2)
  • WithSubgraphRetryOptions (1776-1780)
  • NewSubgraphRetryOptions (1908-1922)
router/pkg/config/config.go (3)
  • TrafficShapingRules (164-171)
  • GlobalSubgraphRequestRule (190-207)
  • BackoffJitterRetry (231-238)
router/internal/expr/retry_expression_test.go (2)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (19-23)
router/internal/expr/retry_context.go (1)
  • LoadRetryContext (132-146)
router/core/router_config.go (1)
router/core/router.go (1)
  • SubgraphRetryOptions (187-191)
router/internal/expr/retry_expression.go (1)
router/internal/expr/retry_context.go (1)
  • RetryContext (13-19)
router-tests/retry_test.go (4)
router/core/router.go (3)
  • NewSubgraphRetryOptions (1908-1922)
  • WithSubgraphRetryOptions (1776-1780)
  • Option (172-172)
router/pkg/config/config.go (4)
  • TrafficShapingRules (164-171)
  • GlobalSubgraphRequestRule (190-207)
  • BackoffJitterRetry (231-238)
  • Config (985-1059)
router/internal/retrytransport/manager.go (1)
  • OnRetryFunc (17-17)
router-tests/testenv/testenv.go (4)
  • Run (105-122)
  • RunWithError (135-150)
  • Environment (1727-1763)
  • SubgraphsConfig (365-378)
router-tests/error_handling_test.go (2)
router/core/router.go (2)
  • WithSubgraphRetryOptions (1776-1780)
  • NewSubgraphRetryOptions (1908-1922)
router/pkg/config/config.go (3)
  • TrafficShapingRules (164-171)
  • GlobalSubgraphRequestRule (190-207)
  • BackoffJitterRetry (231-238)
router/core/graph_server.go (4)
router/internal/retrytransport/manager.go (3)
  • Manager (34-40)
  • NewManager (42-49)
  • OnRetryFunc (17-17)
router/internal/expr/expr_manager.go (1)
  • Manager (15-17)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (19-23)
router/core/retry_builder.go (1)
  • BuildRetryFunction (13-46)
router-tests/structured_logging_test.go (2)
router/core/router.go (2)
  • WithSubgraphRetryOptions (1776-1780)
  • NewSubgraphRetryOptions (1908-1922)
router/pkg/config/config.go (3)
  • TrafficShapingRules (164-171)
  • GlobalSubgraphRequestRule (190-207)
  • BackoffJitterRetry (231-238)
router/internal/retrytransport/manager.go (2)
router/internal/expr/expr.go (1)
  • Request (65-74)
router/internal/expr/retry_expression.go (1)
  • RetryExpressionManager (12-14)
router/core/retry_builder_test.go (3)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (19-23)
router/core/retry_builder.go (1)
  • BuildRetryFunction (13-46)
router/core/context.go (2)
  • OperationTypeQuery (488-488)
  • OperationTypeMutation (489-489)
router/internal/retrytransport/retry_transport.go (3)
router/internal/retrytransport/manager.go (1)
  • Manager (34-40)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Context (34-38)
router/internal/context/keys.go (1)
  • CurrentSubgraphContextKey (7-7)
router/core/retry_builder.go (2)
router/internal/expr/retry_expression.go (1)
  • RetryExpressionManager (12-14)
router/internal/retrytransport/manager.go (1)
  • ShouldRetryFunc (16-16)
router/internal/retrytransport/retry_transport_test.go (5)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Context (34-38)
router/internal/context/keys.go (1)
  • CurrentSubgraphContextKey (7-7)
router/internal/retrytransport/manager.go (4)
  • OnRetryFunc (17-17)
  • RetryOptions (25-32)
  • Manager (34-40)
  • NewManager (42-49)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (19-23)
router/internal/retrytransport/retry_transport.go (1)
  • NewRetryHTTPTransport (77-87)
router-tests/panic_test.go (2)
router/core/router.go (2)
  • WithSubgraphRetryOptions (1776-1780)
  • NewSubgraphRetryOptions (1908-1922)
router/pkg/config/config.go (3)
  • TrafficShapingRules (164-171)
  • GlobalSubgraphRequestRule (190-207)
  • BackoffJitterRetry (231-238)
router/core/transport.go (2)
router/internal/retrytransport/manager.go (1)
  • Manager (34-40)
router/internal/retrytransport/retry_transport.go (1)
  • NewRetryHTTPTransport (77-87)
🔇 Additional comments (23)
router/core/router_config.go (1)

79-79: Switch to pointer: ensure initialization path or guard against nil

Changing retryOptions to *SubgraphRetryOptions is fine, but it increases the chance of nil deref at use sites unless always initialized.

Please confirm that core.Config.retryOptions is assigned during router setup. Currently, WithSubgraphRetryOptions sets r.retryOptions (see router/core/router.go), but I don’t see where c.retryOptions (this struct) is populated. If not populated, prefer guarding in Usage() (see below) or setting it alongside the router field.

router/core/supervisor_instance.go (1)

196-197: Guard disabled subgraph retries before defaulting algorithm
In NewSubgraphRetryOptions (router/core/router.go ¶1909–1921), you always call newRetryConfig for each subgraph, which unconditionally defaults an empty algorithm—even when Enabled==false. Per our retry-disabled rule, skip both the call and algorithm default for subgraphs with v.BackoffJitterRetry.Enabled==false. Wrap the loop entry in an if v.BackoffJitterRetry.Enabled check so disabled rules retain an empty algorithm.

⛔ Skipped due to learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.637Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/core/supervisor_instance.go:196-204
Timestamp: 2025-08-28T09:59:19.956Z
Learning: In router/pkg/config/config.go, the BackoffJitterRetry struct uses envDefault tags to provide default values for Algorithm ("backoff_jitter") and Expression ("IsRetryableStatusCode() || IsConnectionError() || IsTimeout()"), ensuring that even when YAML configs omit these fields, valid defaults are applied automatically, preventing ProcessRetryOptions from rejecting empty values.
router-tests/panic_test.go (1)

51-55: Good: retries explicitly disabled for panic-path stability

Switching to core.NewSubgraphRetryOptions(... Enabled: false ...) ensures the router won’t add subgraph retries that could mask panic timing in these tests. Matches the disabled-validation expectation.

Also applies to: 87-91

router/core/graph_server.go (3)

43-43: Import looks correct

retrytransport is used below; import is appropriate.


104-105: Field addition LGTM

Storing the retry manager on graphServer is fine and consistent with how other managers (circuit) are handled.


1208-1209: Wiring RetryManager into transport options is correct

ExecutorConfigurationBuilder receives the manager; assuming transport handles nil, this is safe.

If not already covered in tests, add one integration that omits WithSubgraphRetryOptions entirely to ensure nil RetryManager is handled gracefully.

router/internal/expr/retry_expression_test.go (2)

125-135: Manager-based API usage looks correct

Instantiating the manager, pre-registering expressions with AddExpression, and evaluating via ShouldRetry(ctx, expr) matches the new API. Good assertions on compile-time errors for invalid expressions.


496-525: Good syscall coverage with the manager path

Nice coverage for ECONNREFUSED/ECONNRESET/ETIMEDOUT under the manager-evaluated expression. This ensures helper wiring remains intact after the API shift.

router/core/router.go (1)

1776-1779: Confirmed retryOptions field exists on Router.Config
Router embeds Config, which declares retryOptions *SubgraphRetryOptions, so the WithSubgraphRetryOptions setter compiles as expected.

router/core/retry_builder.go (1)

55-58: EOF-as-default-retryable matches stated team preference

Keeping this simple with string matching for “unexpected eof” is consistent with prior guidance to wait for customer feedback before broadening detection.

router-tests/retry_test.go (4)

38-51: Good adoption of builder-style retry options.

Switching to NewSubgraphRetryOptions + WithSubgraphRetryOptions makes tests clearer and mirrors the new API.


247-306: Exact sleep equality on Retry-After is OK here.

Asserting sleepDuration equals header seconds is deterministic since jitter shouldn’t apply when Retry-After is honored.


649-674: Great coverage of mixed global + per-subgraph overrides.

Validates default-from-All and targeted override behavior cleanly.

Also applies to: 676-719


363-371: Unable to run flakiness check in sandbox
The go test invocation failed due to missing Go module context. Please run locally and confirm stability:

go test ./router-tests -run 'TestFlakyRetry/verify max duration is not exceeded on intervals$' -count=50 -v

If you observe intermittent failures, remove t.Parallel() from this subtest or widen the timing tolerances around the expected duration.

router/internal/retrytransport/retry_transport_test.go (3)

58-61: withSubgraph helper looks correct.

Uses the typed context key; safe and explicit.


204-231: Solid body-draining tests.

Covers read and close failure paths with log assertions. Good observability checks.

Also applies to: 300-357


483-538: Thorough 429 coverage (seconds, HTTP-date, invalid, negative, mixed).

Great breadth; confirms shouldUseRetryAfter behavior and fallbacks.

Also applies to: 540-596, 598-636, 638-697, 699-741, 742-782, 784-863

router/core/retry_builder_test.go (3)

60-85: Default expression behavior assertions look good.

Covers 5xx and default network errors.


241-270: Mutation short-circuit behavior well covered.

Ensures expressions can’t override the “no retry on mutations” rule.


390-439: Comprehensive positive matrix + mutation negative matrix.

Good end-to-end validation of the expression path.

router/internal/retrytransport/manager.go (1)

61-69: LGTM on skipping algorithm validation when retries are disabled.

Disabled per-subgraph entries are excluded from validation, aligning with the prior decision captured in learnings.

router/core/transport.go (2)

105-107: Wiring to Manager-based retry looks correct.

Conditional wrapping with NewRetryHTTPTransport is clean and preserves the base round tripper path when disabled.


73-75: All call sites updated. Verified NewCustomTransport/NewRetryHTTPTransport and RetryManager usage in router/core/transport.go (lines 106, 367, 388, 428) and router/core/graph_server.go (line 1208).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
router/internal/retrytransport/retry_transport.go (1)

145-151: Rewind request body before retrying; otherwise subsequent attempts may send an empty/partial body.

GraphQL subgraph calls are typically POST with bodies. Without resetting Body via GetBody, retries can be broken.

 		// drain the previous response before retrying
 		rt.drainBody(resp, requestLogger)
 
 		// Retry the request
-		resp, err = rt.roundTripper.RoundTrip(req)
+		// Reset request body if necessary
+		if req.Body != nil && req.Body != http.NoBody {
+			if req.GetBody == nil {
+				requestLogger.Warn("Cannot retry request with non-rewindable body")
+				break
+			}
+			newBody, gbErr := req.GetBody()
+			if gbErr != nil {
+				requestLogger.Error("Failed to reset request body for retry", zap.Error(gbErr))
+				return resp, gbErr
+			}
+			req.Body = newBody
+		}
+		resp, err = rt.roundTripper.RoundTrip(req)
♻️ Duplicate comments (3)
router-tests/retry_test.go (3)

503-531: Great: base-level invalid algorithm is correctly rejected at init.
Matches expected behavior and error text.


533-558: Nice: invalid algorithm is ignored when retries are disabled (base).
This addresses the previously requested coverage.


560-587: Nice: per-subgraph “disabled ignores algorithm” case covered.
Prevents regressions in validation gating.

🧹 Nitpick comments (14)
router-tests/retry_test.go (8)

38-51: DRY the options setup with a small helper to reduce duplication.
Multiple tests repeat the same NewSubgraphRetryOptions + WithSubgraphRetryOptions pattern; a local builder will simplify maintenance.

Apply this helper once near the top and replace repeated blocks:

+func withRetryOpts(cfg config.TrafficShapingRules, onRetry func(int, *http.Request, *http.Response, time.Duration, error)) core.Option {
+  opts := core.NewSubgraphRetryOptions(cfg)
+  opts.OnRetryFunc = onRetry
+  return core.WithSubgraphRetryOptions(opts)
+}

Also applies to: 92-105, 145-158, 199-212, 260-273, 324-337, 391-404, 451-464


224-231: Decouple success trigger from OnRetry callback semantics.
Using onRetryCounter to flip success couples the test to callback timing. Drive success via request attempt count to avoid false positives if OnRetry invocation timing changes.

- if onRetryCounter.Load() == int32(maxAttemptsBeforeServiceSucceeds) {
+ if serviceCallsCounter.Load() == int32(maxAttemptsBeforeServiceSucceeds) {
    w.WriteHeader(http.StatusOK)
    _, _ = w.Write([]byte(`{"data":{"employees":[{"id":1},{"id":2}]}}`))
  } else {
    w.WriteHeader(http.StatusBadGateway)
  }
  serviceCallsCounter.Add(1)

256-257: Speed up the 429 + Retry-After test.
This test currently sleeps ~3s (3 retries × 1s). One retry is sufficient to validate header precedence.

- maxRetryCount := 3
+ maxRetryCount := 1

433-434: Avoid flakiness: assert a range instead of NotEqual.
Exact equality with retryInterval can sporadically occur depending on jitter/rounding.

- require.NotEqual(t, sleepDuration.Load(), int64(retryInterval))
+ d := time.Duration(sleepDuration.Load())
+ require.Greater(t, d, time.Duration(0))
+ require.Less(t, d, 2*retryInterval)

494-495: Same here: use bounded checks to reduce flakiness.
Zero Retry-After should fall back to jitter/backoff; prefer range assertions.

- require.NotEqual(t, sleepDuration.Load(), int64(retryInterval))
+ d := time.Duration(sleepDuration.Load())
+ require.Greater(t, d, time.Duration(0))
+ require.Less(t, d, 2*retryInterval)

360-362: Prefer JSON structural equality over string equality for response bodies.
String comparisons are brittle to formatting/ordering changes.

- require.Equal(t, `{"errors":[{"message":"Failed to fetch from Subgraph 'employees', Reason: empty response.","extensions":{"statusCode":502}}],"data":{"employees":null}}`, res.Body)
+ require.JSONEq(t, `{"errors":[{"message":"Failed to fetch from Subgraph 'employees', Reason: empty response.","extensions":{"statusCode":502}}],"data":{"employees":null}}`, res.Body)

(Apply similarly to the other exact body checks.)

Also applies to: 681-685, 761-763


366-372: Widen timing slack to reduce CI flakes.
A 20ms/100ms margin can be tight under load; slightly larger buffers help.

- shouldBeLessThanDuration := (time.Duration(maxRetryCount-1) * retryInterval) - (20 * time.Millisecond)
+ shouldBeLessThanDuration := (time.Duration(maxRetryCount-1) * retryInterval) - (50 * time.Millisecond)
...
- expectedMinDuration := (time.Duration(maxRetryCount-1) * maxDuration) - (100 * time.Millisecond)
+ expectedMinDuration := (time.Duration(maxRetryCount-1) * maxDuration) - (150 * time.Millisecond)

621-697: Per-subgraph retry counts validated correctly.
Isolated counters for employees vs test1 look good.

If useful, I can add a combined query hitting both subgraphs in a single operation to verify independent retry accounting within one request.

router/internal/retrytransport/retry_transport.go (6)

142-144: Honor context cancellation/deadlines during backoff sleep.

Avoid sleeping past ctx deadlines; exit early if canceled.

-		// Wait for the specified duration
-		time.Sleep(sleepDuration)
+		// Wait for the specified duration or context cancellation
+		if sleepDuration > 0 {
+			select {
+			case <-req.Context().Done():
+				return resp, req.Context().Err()
+			case <-time.After(sleepDuration):
+			}
+		}

59-66: Also honor Retry-After for 503 Service Unavailable.

RFC 7231 allows Retry-After on 503 responses as well as 429. This improves interoperability with upstreams.

-	if resp == nil || resp.StatusCode != http.StatusTooManyRequests {
+	if resp == nil || (resp.StatusCode != http.StatusTooManyRequests && resp.StatusCode != http.StatusServiceUnavailable) {
 		return 0, false
 	}

124-128: Generic log for Retry-After usage (it may be 429 or 503).

-			requestLogger.Debug("Using Retry-After header for 429 response",
+			requestLogger.Debug("Using Retry-After header",
 				zap.Int("retry", retries),
 				zap.String("url", req.URL.String()),
 				zap.Duration("retry-after", sleepDuration),
 			)

51-53: Lower log level for malformed Retry-After headers.

Header parsing errors are client-observable but non-fatal; debug (or warn) is less noisy in production.

-		logger.Error("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin))
+		logger.Debug("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin))

173-176: Tighten the comment wording.

-	// When we close the body only will go internally marks the persisted connection as true
-	// which is important so that it can reuse the connection internally for retrying
+	// Draining then closing marks the connection as reusable, allowing the client to reuse it for retries.

105-111: Optional: short-circuit when retries are effectively disabled.

If MaxRetryCount == 0 (or options are disabled), return immediately to avoid backoff setup and logging overhead. Note: per team learning, IsEnabled() handles nil receivers safely if available.

 	retryOptions := rt.retryManager.GetSubgraphOptions(subgraph)
-	if retryOptions == nil {
-		return resp, nil
-	}
+	if retryOptions == nil {
+		return resp, err
+	}
+	// Optional fast-exit:
+	// if retryOptions.MaxRetryCount == 0 { return resp, err }
+	// or if available:
+	// if !retryOptions.IsEnabled() { return resp, err }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d86e5cc and 79cdb75.

📒 Files selected for processing (3)
  • router-tests/retry_test.go (10 hunks)
  • router/internal/retrytransport/manager.go (1 hunks)
  • router/internal/retrytransport/retry_transport.go (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router/internal/retrytransport/manager.go
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-09-01T13:44:13.033Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.033Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false as the first check, so no external nil checks are needed when calling c.retryOptions.IsEnabled(). This is a consistent pattern used across multiple IsEnabled() methods in the codebase.

Applied to files:

  • router-tests/retry_test.go
  • router/internal/retrytransport/retry_transport.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router-tests/retry_test.go
📚 Learning: 2025-09-01T13:44:13.033Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.033Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false, so no external nil checks are needed when calling c.retryOptions.IsEnabled().

Applied to files:

  • router-tests/retry_test.go
  • router/internal/retrytransport/retry_transport.go
📚 Learning: 2025-08-28T10:07:27.637Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.637Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.

Applied to files:

  • router-tests/retry_test.go
🧬 Code graph analysis (2)
router-tests/retry_test.go (4)
router/core/router.go (3)
  • NewSubgraphRetryOptions (1908-1922)
  • WithSubgraphRetryOptions (1776-1780)
  • Option (172-172)
router/pkg/config/config.go (4)
  • TrafficShapingRules (164-171)
  • GlobalSubgraphRequestRule (190-207)
  • BackoffJitterRetry (231-238)
  • Config (985-1059)
router/internal/retrytransport/manager.go (1)
  • OnRetryFunc (17-17)
router-tests/testenv/testenv.go (4)
  • Run (105-122)
  • RunWithError (135-150)
  • Environment (1727-1763)
  • SubgraphsConfig (365-378)
router/internal/retrytransport/retry_transport.go (3)
router/internal/retrytransport/manager.go (1)
  • Manager (34-40)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Context (34-38)
router/internal/context/keys.go (1)
  • CurrentSubgraphContextKey (7-7)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: integration_test (./events)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_test
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: image_scan
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (4)
router-tests/retry_test.go (4)

10-13: Good import adjustments for new API usage.
The added imports align with the builder-style options and testify assertions.


16-23: OnRetry counter helper is concise and thread-safe.
Atomic counters and nil-guard for duration are correct.


589-619: Per-subgraph invalid algorithm correctly fails fast.
Solid negative test to mirror the base-level case.


699-775: Mixed global + per-subgraph overrides behave as intended.
Employees inherit global, test1 overrides; counters confirm isolation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
router/internal/retrytransport/retry_transport.go (1)

147-152: Critical: retries will resend a consumed request body

On retry, req.Body is already consumed. If req.GetBody is unset, retries are unsafe; if set, reset Body before each retry. Otherwise you’ll send an empty body or fail.

Apply:

-		// Wait for the specified duration
-		time.Sleep(sleepDuration)
-
-		// drain the previous response before retrying
-		rt.drainBody(resp, requestLogger)
+		// Drain previous response ASAP to free the connection for reuse
+		rt.drainBody(resp, requestLogger)
+
+		// Respect request cancellation while waiting
+		if sleepDuration > 0 {
+			timer := time.NewTimer(sleepDuration)
+			select {
+			case <-req.Context().Done():
+				timer.Stop()
+				return resp, req.Context().Err()
+			case <-timer.C:
+			}
+		}
+
+		// Rewind request body if present
+		if req.Body != nil {
+			if req.GetBody == nil {
+				requestLogger.Debug("cannot retry: non-rewindable request body")
+				break
+			}
+			if newBody, gErr := req.GetBody(); gErr != nil {
+				return resp, gErr
+			} else {
+				req.Body = newBody
+			}
+		}
 
 		// Retry the request
 		resp, err = rt.roundTripper.RoundTrip(req)
♻️ Duplicate comments (1)
router/internal/retrytransport/retry_transport.go (1)

104-108: Nice fix: avoids a second request when no retry options

Returning the outcome of the first attempt preserves semantics and errors.

🧹 Nitpick comments (5)
router/internal/retrytransport/retry_transport.go (5)

58-75: Honor Retry-After for 503 as well and accept zero-second delays

RFC 7231 defines Retry-After for 503 and 429. Also treat “0” as a valid immediate retry (no sleep), not a fallback to random backoff.

Apply:

-func shouldUseRetryAfter(logger *zap.Logger, resp *http.Response, maxDuration time.Duration) (time.Duration, bool) {
-	if resp == nil || resp.StatusCode != http.StatusTooManyRequests {
+func shouldUseRetryAfter(logger *zap.Logger, resp *http.Response, maxDuration time.Duration) (time.Duration, bool) {
+	if resp == nil || (resp.StatusCode != http.StatusTooManyRequests && resp.StatusCode != http.StatusServiceUnavailable) {
 		return 0, false
 	}
@@
-	return duration, duration > 0
+	// Accept immediate retry (0s) as valid
+	return duration, duration >= 0

22-56: Minor: trim header value before parsing; lower log level on parse failures

Real-world proxies sometimes add whitespace; parsing “ 120 ” should work. Also consider debug level to reduce noise.

Apply:

 import (
 	"errors"
 	"io"
 	"net/http"
+	"strings"
 	"strconv"
 	"time"
@@
-func parseRetryAfterHeader(logger *zap.Logger, retryAfter string) time.Duration {
+func parseRetryAfterHeader(logger *zap.Logger, retryAfter string) time.Duration {
 	if retryAfter == "" {
 		return 0
 	}
 
 	var errJoin error
 
-	seconds, err := strconv.Atoi(retryAfter)
+	retryAfter = strings.TrimSpace(retryAfter)
+	seconds, err := strconv.Atoi(retryAfter)
@@
-	if errJoin != nil {
-		logger.Error("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin))
-	}
+	if errJoin != nil {
+		logger.Debug("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin))
+	}

144-149: Drain before sleeping to free connections earlier

Draining before the wait reduces connection pinning during long sleeps.

(Already included in the diff above.)


183-187: Guard against nil resp in helpers (defensive)

Not strictly necessary given current call sites, but adding a nil check makes the helper safer.

Apply:

 func isResponseOK(resp *http.Response) bool {
+	if resp == nil {
+		return false
+	}
 	// Ensure we don't wait for no reason when subgraphs don't behave
 	// spec-compliant and returns a different status code than 200.
 	return resp.StatusCode >= 200 && resp.StatusCode < 300
 }

175-181: Polish the comment grammar

Improve clarity of the connection reuse note.

Apply:

-	// When we close the body only will go internally marks the persisted connection as true
-	// which is important so that it can reuse the connection internally for retrying
+	// Fully consuming and closing the body marks the connection reusable
+	// so the Transport can keep-alive and reuse it for retries.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 79cdb75 and 3344631.

📒 Files selected for processing (2)
  • router/internal/retrytransport/manager.go (1 hunks)
  • router/internal/retrytransport/retry_transport.go (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router/internal/retrytransport/manager.go
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.637Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
📚 Learning: 2025-09-01T13:44:13.033Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.033Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false as the first check, so no external nil checks are needed when calling c.retryOptions.IsEnabled(). This is a consistent pattern used across multiple IsEnabled() methods in the codebase.

Applied to files:

  • router/internal/retrytransport/retry_transport.go
📚 Learning: 2025-09-01T13:44:13.033Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.033Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false, so no external nil checks are needed when calling c.retryOptions.IsEnabled().

Applied to files:

  • router/internal/retrytransport/retry_transport.go
📚 Learning: 2025-09-01T13:26:16.245Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/retry_builder.go:13-15
Timestamp: 2025-09-01T13:26:16.245Z
Learning: In router/core/retry_builder.go, the BuildRetryFunction's manager parameter is guaranteed to be non-nil by the calling code, so defensive nil checks are not needed according to the team's preference.

Applied to files:

  • router/internal/retrytransport/retry_transport.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/internal/retrytransport/retry_transport.go
🧬 Code graph analysis (1)
router/internal/retrytransport/retry_transport.go (3)
router/internal/retrytransport/manager.go (1)
  • Manager (34-40)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Context (34-38)
router/internal/context/keys.go (1)
  • CurrentSubgraphContextKey (7-7)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: Analyze (go)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: image_scan
  • GitHub Check: build_push_image
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_test
🔇 Additional comments (5)
router/internal/retrytransport/retry_transport.go (5)

16-20: Good: transport now composes roundTripper and retryManager cleanly

Field injection improves testability and decoupling.


77-87: Constructor wiring looks solid

Signature change and assignments are consistent.


139-142: OnRetry hook usage is correct

API matches Manager.OnRetry(retries, req, resp, sleepDuration, err).


110-111: Usage correct: backoff.New expects (max, interval) signature
The arguments retryOptions.MaxDuration (max) and retryOptions.Interval (interval) align with func New(max time.Duration, interval time.Duration) *Backoff; no changes needed.


96-103: Verify CurrentSubgraphContextKey value type
Ensure that any value stored under rcontext.CurrentSubgraphContextKey is always a string; adjust the type assertion if other types are ever used.

@SkArchon SkArchon changed the title feat: retry per subgraphs feat: specify retry configuration overrides per subgraph Sep 1, 2025
@SkArchon SkArchon requested a review from endigma as a code owner September 2, 2025 11:23
@SkArchon SkArchon requested a review from StarpTech September 3, 2025 13:10
Copy link
Contributor

@StarpTech StarpTech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
router/core/graph_server.go (1)

281-283: Wrap retry manager init errors with context for faster triage.

Propagate a contextualized error to distinguish init failures in logs.

Apply this diff:

-        if err != nil {
-            return nil, err
-        }
+        if err != nil {
+            return nil, fmt.Errorf("retry manager initialization failure: %w", err)
+        }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ddb674 and 24fbca2.

📒 Files selected for processing (4)
  • router/core/graph_server.go (4 hunks)
  • router/core/router.go (3 hunks)
  • router/core/router_config.go (3 hunks)
  • router/core/transport.go (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • router/core/router_config.go
  • router/core/router.go
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.650Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
📚 Learning: 2025-09-01T13:26:16.284Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/retry_builder.go:13-15
Timestamp: 2025-09-01T13:26:16.284Z
Learning: In router/core/retry_builder.go, the BuildRetryFunction's manager parameter is guaranteed to be non-nil by the calling code, so defensive nil checks are not needed according to the team's preference.

Applied to files:

  • router/core/graph_server.go
🧬 Code graph analysis (2)
router/core/transport.go (2)
router/internal/retrytransport/manager.go (1)
  • Manager (33-39)
router/internal/retrytransport/retry_transport.go (1)
  • NewRetryHTTPTransport (77-87)
router/core/graph_server.go (3)
router/internal/retrytransport/manager.go (3)
  • Manager (33-39)
  • NewManager (41-48)
  • OnRetryFunc (16-16)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (19-23)
router/core/retry_builder.go (1)
  • BuildRetryFunction (13-46)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: build_test
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./telemetry)
🔇 Additional comments (3)
router/core/graph_server.go (1)

269-286: Centralized retry manager wiring looks good; no nil defensive needed for BuildRetryFunction.

Creating the RetryExpressionManager, building ShouldRetryFunc, and passing OnRetry is the right flow. Skipping a nil‑check on the expr manager aligns with the team’s preference noted in learnings.

router/core/transport.go (2)

351-352: Retry manager plumbing through factory/options/call site looks consistent.

Also applies to: 372-373, 393-395, 439-441


73-75: Constructor signature changed — verify all call sites updated

The constructor now replaces retryOptions with retryManager and the parameter order was changed; confirm every call site was updated to the new order/type to avoid misordered args. Search for usages (e.g. git grep -n "NewCustomTransport(" || rg -n --hidden -uu "NewCustomTransport\s*\(") and fix any mismatches.

@SkArchon SkArchon force-pushed the milinda/retry-per-subgraphs branch from d428949 to de80950 Compare September 16, 2025 16:04
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (6)
router/internal/retrytransport/retry_transport_test.go (1)

64-71: Enable retries in test manager to mirror prod behavior.

RetryOptions.Enabled defaults to false; set it explicitly so tests don’t rely on downstream behavior. (Repeats a prior suggestion.)

 func newTestManager(shouldRetry func(error, *http.Request, *http.Response) bool, onRetry OnRetryFunc, opts RetryOptions, subgraphName string) *Manager {
   mgr := NewManager(expr.NewRetryExpressionManager(), func(err error, req *http.Request, resp *http.Response, exprString string) bool {
     return shouldRetry(err, req, resp)
   }, onRetry)
   // attach options for the subgraph
+  opts.Enabled = true
   mgr.retries[subgraphName] = &opts
   return mgr
 }
router/core/transport.go (1)

120-124: Nil-pointer concern addressed; keep as-is.

retryManager.IsEnabled() is nil-receiver safe per codebase convention, so no guard is needed. This supersedes the earlier bot warning about a potential panic.

Consider a small test ensuring: with retries disabled (nil manager), NewCustomTransport picks baseRoundTripper and does not panic.

router-tests/retry_test.go (4)

495-496: Ditto: assert non-zero backoff for zero Retry-After case.

-                require.NotEqual(t, sleepDuration.Load(), int64(retryInterval))
+                require.Greater(t, sleepDuration.Load(), int64(0))
+                require.NotEqual(t, sleepDuration.Load(), int64(retryInterval))

534-559: “Disabled ignores algorithm” base test — nice.

Matches team guidance; prevents regressions.


561-589: Per‑subgraph “disabled ignores algorithm” test — nice.

Good mirror of the base case.


649-677: Same: relax error matching for per‑subgraph invalid expression.

-        require.ErrorContains(t, err, "failed to add retry expression for subgraph employees: failed to compile retry expression: line 1, column 0: unknown name truethere")
+        require.ErrorContains(t, err, "failed to add retry expression for subgraph employees")
+        require.ErrorContains(t, err, "failed to compile retry expression")
+        require.ErrorContains(t, err, "unknown name")
🧹 Nitpick comments (17)
router-tests/websocket_test.go (1)

709-719: Set algorithm explicitly when retries are enabled; consider a narrower expression.

When Enabled=true, leaving Algorithm empty can trip validation depending on newRetryConfig. Also, "true" will retry on every failure (incl. 4xx), which can mask client/config errors in this test. Suggest setting Algorithm and using the default expression.

 opts := core.NewSubgraphRetryOptions(config.TrafficShapingRules{
   All: config.GlobalSubgraphRequestRule{
     BackoffJitterRetry: config.BackoffJitterRetry{
       Enabled:     true,
+      Algorithm:   "backoff_jitter",
       MaxAttempts: 5,
       MaxDuration: 10 * time.Second,
       Interval:    200 * time.Millisecond,
-      Expression:  "true",
+      Expression:  "IsRetryableStatusCode() || IsConnectionError() || IsTimeout()",
     },
   },
 })

If unconditional retry is intentional here, keep Expression: "true" but still set Algorithm.

router-tests/circuit_breaker_test.go (2)

627-758: WebSocket conn leak on success path — close the client connection.

After the successful subscription flow, the WebSocket client connection isn’t closed. Close it to avoid leaking fds in CI and to keep tests hermetic.

Apply:

@@
 			xEnv.WaitForSubscriptionCount(0, time.Second*5)
+			require.NoError(t, conn.Close())
 		})
 	})

709-709: Stabilize assertion: compare JSON payloads semantically, not byte-for-byte.

Use JSONEq to avoid brittleness from spacing/ordering differences in error frames.

-				require.Equal(t, defaultErrorMessage, string(message))
+				require.JSONEq(t, defaultErrorMessage, string(message))
router/internal/circuit/breaker.go (2)

30-39: Don’t unconditionally override context-derived subgraph; fall back only when override is non-empty.

Right now the context value is computed and immediately discarded. Preserve it when the override returns an empty string to avoid routing “no subgraph” and silently bypassing breakers.

-	subgraph = rt.getActiveSubgraphName(req)
+	if fn := rt.getActiveSubgraphName; fn != nil {
+		if sg := fn(req); sg != "" {
+			subgraph = sg
+		}
+	}

18-24: Defensive default for nil getActiveSubgraphName.

A nil callback will panic on RoundTrip. Guard in the constructor with a safe default to the context-derived subgraph.

-func NewCircuitTripper(roundTripper http.RoundTripper, breaker *Manager, logger func(req *http.Request) *zap.Logger, getActiveSubgraphName func(req *http.Request) string) *Breaker {
+func NewCircuitTripper(roundTripper http.RoundTripper, breaker *Manager, logger func(req *http.Request) *zap.Logger, getActiveSubgraphName func(req *http.Request) string) *Breaker {
+	if getActiveSubgraphName == nil {
+		getActiveSubgraphName = func(req *http.Request) string {
+			if sg, ok := req.Context().Value(rcontext.CurrentSubgraphContextKey{}).(string); ok {
+				return sg
+			}
+			return ""
+		}
+	}
 	return &Breaker{
 		circuitBreaker:        breaker,
 		loggerFunc:            logger,
 		roundTripper:          roundTripper,
 		getActiveSubgraphName: getActiveSubgraphName,
 	}
}
router/internal/retrytransport/retry_transport.go (1)

135-143: Respect request cancellation during backoff and reset request body before retries.

  • Sleep should be ctx-aware to avoid hanging on canceled requests.
  • For requests with bodies, reinitialize req.Body via GetBody before re-sending; otherwise retries after a consumed body can fail.
-		// Wait for the specified duration
-		time.Sleep(sleepDuration)
+		// Wait for the specified duration, but respect cancellation
+		select {
+		case <-time.After(sleepDuration):
+		case <-req.Context().Done():
+			return resp, req.Context().Err()
+		}
 
 		// drain the previous response before retrying
 		rt.drainBody(resp, requestLogger)
 
 		// Retry the request
-		resp, err = rt.roundTripper.RoundTrip(req)
+		// Rewind body if present
+		if req.Body != nil {
+			if req.GetBody == nil {
+				requestLogger.Warn("Request has non-rewindable body; aborting retries")
+				return resp, err
+			}
+			if nb, gErr := req.GetBody(); gErr != nil {
+				requestLogger.Warn("Failed to reset request body", zap.Error(gErr))
+				return resp, gErr
+			} else {
+				req.Body = nb
+			}
+		}
+		resp, err = rt.roundTripper.RoundTrip(req)
router/core/transport.go (2)

87-89: Nil-safe logger fallback: good; tiny alloc nit.

Returning zap.NewNop() per call is fine; you can cache it to avoid repeated allocs.

Apply:

-	getRequestContextLogger := func(req *http.Request) *zap.Logger {
+	nopLogger := zap.NewNop()
+	getRequestContextLogger := func(req *http.Request) *zap.Logger {
 		reqContext := getRequestContext(req.Context())
 		if reqContext == nil {
-			return zap.NewNop()
+			return nopLogger
 		}
 		return reqContext.Logger()
 	}

382-383: TransportOptions adds RetryManager: LGTM; clarify nil is allowed.

Optional: add a doc comment that nil means “retries disabled/default only.”

router-tests/retry_test.go (9)

223-233: Avoid coupling test behavior to OnRetryFunc invocation semantics.

Using onRetryCounter to flip success implicitly relies on when OnRetryFunc fires. Prefer basing success on serviceCallsCounter (observable I/O), which is less brittle.

Apply:

-                            if onRetryCounter.Load() == int32(maxAttemptsBeforeServiceSucceeds) {
+                            if serviceCallsCounter.Load() == int32(maxAttemptsBeforeServiceSucceeds) {
                               w.WriteHeader(http.StatusOK)
                               _, _ = w.Write([]byte(`{"data":{"employees":[{"id":1},{"id":2}]}}`))
                             } else {
                               w.WriteHeader(http.StatusBadGateway)
                             }
                             serviceCallsCounter.Add(1)

297-306: Respecting Retry-After is correct; consider also covering HTTP-date format.

Spec allows Retry-After as delta-seconds or an HTTP-date. Add a sibling subtest exercising the HTTP-date form to lock parser behavior.

Would you like me to draft that subtest?


367-373: Deflake timing bounds.

The 20/100 ms margins may be tight under CI noise. Widen slightly to reduce flakes while preserving the assertion intent.

-            shouldBeLessThanDuration := (time.Duration(maxRetryCount-1) * retryInterval) - (20 * time.Millisecond)
-            require.Less(t, requestDuration, shouldBeLessThanDuration)
-
-            // We reduce by 100 for any jitter
-            expectedMinDuration := (time.Duration(maxRetryCount-1) * maxDuration) - (100 * time.Millisecond)
-            require.GreaterOrEqual(t, requestDuration, expectedMinDuration)
+            upperBound := (time.Duration(maxRetryCount-1) * retryInterval) - (50 * time.Millisecond)
+            require.Less(t, requestDuration, upperBound)
+
+            // Allow more slack for jitter and scheduler variability
+            lowerBound := (time.Duration(maxRetryCount-1) * maxDuration) - (150 * time.Millisecond)
+            require.GreaterOrEqual(t, requestDuration, lowerBound)

430-435: Strengthen assertion: also assert non-zero backoff.

Jitter may occasionally equal the nominal interval; NotEqual is weak. At minimum, assert it’s > 0 to ensure a backoff occurred.

-                require.NotEqual(t, sleepDuration.Load(), int64(retryInterval))
+                require.Greater(t, sleepDuration.Load(), int64(0))
+                require.NotEqual(t, sleepDuration.Load(), int64(retryInterval))

621-647: Relax overly specific error substring.

Error details like “line 1, column 0” can change. Assert the stable portions to avoid brittle tests.

-        require.ErrorContains(t, err, "failed to add base retry expression: failed to compile retry expression: line 1, column 0: unknown name truethere")
+        require.ErrorContains(t, err, "failed to add base retry expression")
+        require.ErrorContains(t, err, "failed to compile retry expression")
+        require.ErrorContains(t, err, "unknown name")

757-763: Tidy: underscore unused handler param in middleware.

Avoids linter noise and clarifies intent.

-                    Middleware: func(handler http.Handler) http.Handler {
+                    Middleware: func(_ http.Handler) http.Handler {

Also applies to: 835-841


931-959: Close the WS client connection with defer.

Ensures cleanup on early failures.

-            conn := xEnv.InitGraphQLWebSocketConnection(nil, nil, nil)
+            conn := xEnv.InitGraphQLWebSocketConnection(nil, nil, nil)
+            defer func() { _ = conn.Close() }()
@@
-            require.NoError(t, conn.Close())
+            // Closed by defer

Also applies to: 997-1004, 1027-1029


865-929: Cover “All enabled + specific subgraph disabled” override.

Add a subtest where All.Enabled=true but Subgraphs["test1"].Enabled=false, and assert test1 has zero retries while employees inherits All.

I can draft this subtest if you want it in this PR.


845-852: Prefer JSONEq for response assertions to reduce brittleness.
Replace raw string equality checks of JSON responses with require.JSONEq(t, expectedJSON, res.Body) to avoid failures from insignificant formatting/ordering changes.
Affected: router-tests/retry_test.go — lines 78, 129, 182, 241, 298, 362, 428, 489, 771, 781, 849, 859, 915, 925.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 24fbca2 and d428949.

📒 Files selected for processing (9)
  • router-tests/circuit_breaker_test.go (2 hunks)
  • router-tests/retry_test.go (11 hunks)
  • router-tests/websocket_test.go (2 hunks)
  • router/core/router_config.go (2 hunks)
  • router/core/transport.go (6 hunks)
  • router/internal/circuit/breaker.go (2 hunks)
  • router/internal/retrytransport/retry_transport.go (4 hunks)
  • router/internal/retrytransport/retry_transport_test.go (17 hunks)
  • router/internal/traceclient/traceclient.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router/core/router_config.go
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.650Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
📚 Learning: 2025-09-03T11:38:45.855Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2183
File: router/internal/traceclient/traceclient.go:115-118
Timestamp: 2025-09-03T11:38:45.855Z
Learning: In the Cosmo codebase, the exprContext obtained from getExprContext in TraceInjectingRoundTripper.processConnectionMetrics is expected to always be non-nil as part of the existing system behavior, so defensive nil checks are not needed.

Applied to files:

  • router/internal/traceclient/traceclient.go
📚 Learning: 2025-09-01T13:26:16.284Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/retry_builder.go:13-15
Timestamp: 2025-09-01T13:26:16.284Z
Learning: In router/core/retry_builder.go, the BuildRetryFunction's manager parameter is guaranteed to be non-nil by the calling code, so defensive nil checks are not needed according to the team's preference.

Applied to files:

  • router/internal/retrytransport/retry_transport_test.go
  • router/internal/retrytransport/retry_transport.go
  • router/core/transport.go
📚 Learning: 2025-09-01T13:44:13.045Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.045Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false as the first check, so no external nil checks are needed when calling c.retryOptions.IsEnabled(). This is a consistent pattern used across multiple IsEnabled() methods in the codebase.

Applied to files:

  • router/internal/retrytransport/retry_transport_test.go
  • router-tests/websocket_test.go
  • router-tests/retry_test.go
  • router/internal/retrytransport/retry_transport.go
  • router/core/transport.go
📚 Learning: 2025-09-01T13:44:13.045Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.045Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false, so no external nil checks are needed when calling c.retryOptions.IsEnabled().

Applied to files:

  • router/internal/retrytransport/retry_transport_test.go
  • router-tests/websocket_test.go
  • router-tests/retry_test.go
  • router/internal/retrytransport/retry_transport.go
  • router/core/transport.go
📚 Learning: 2025-09-16T16:00:04.078Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/transport.go:0-0
Timestamp: 2025-09-16T16:00:04.078Z
Learning: The IsEnabled() method on *retrytransport.Manager in router/core/transport.go safely handles nil receivers by returning false, so no external nil checks are needed when calling retryManager.IsEnabled().

Applied to files:

  • router/internal/retrytransport/retry_transport_test.go
  • router/internal/retrytransport/retry_transport.go
  • router/core/transport.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router-tests/websocket_test.go
  • router-tests/retry_test.go
  • router/internal/retrytransport/retry_transport.go
📚 Learning: 2025-08-28T09:18:10.121Z
Learnt from: endigma
PR: wundergraph/cosmo#2141
File: router-tests/http_subscriptions_test.go:100-108
Timestamp: 2025-08-28T09:18:10.121Z
Learning: In router-tests/http_subscriptions_test.go heartbeat tests, the message ordering should remain strict with data messages followed by heartbeat messages, as the timing is deterministic and known by design in the Cosmo router implementation.

Applied to files:

  • router-tests/circuit_breaker_test.go
  • router-tests/retry_test.go
📚 Learning: 2025-08-28T10:07:27.650Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.650Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.

Applied to files:

  • router-tests/retry_test.go
📚 Learning: 2025-08-26T17:19:28.178Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/internal/expr/retry_context.go:3-10
Timestamp: 2025-08-26T17:19:28.178Z
Learning: In the Cosmo router codebase, prefer using netErr.Timeout() checks over explicit context.DeadlineExceeded checks for timeout detection in retry logic, as Go's networking stack typically wraps context deadline exceeded errors in proper net.Error implementations.

Applied to files:

  • router-tests/retry_test.go
🧬 Code graph analysis (8)
router/internal/circuit/breaker.go (1)
router/internal/circuit/manager.go (1)
  • Manager (29-35)
router/internal/traceclient/traceclient.go (2)
router/pkg/metric/connection_metric_store.go (1)
  • ConnectionMetricStore (25-29)
router/internal/expr/expr.go (4)
  • Context (35-39)
  • Request (66-75)
  • Subgraph (141-146)
  • ClientTrace (135-138)
router/internal/retrytransport/retry_transport_test.go (3)
router/internal/retrytransport/manager.go (4)
  • OnRetryFunc (16-16)
  • RetryOptions (24-31)
  • Manager (33-39)
  • NewManager (41-48)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (19-23)
router/internal/retrytransport/retry_transport.go (1)
  • NewRetryHTTPTransport (76-83)
router-tests/websocket_test.go (2)
router/core/router.go (2)
  • NewSubgraphRetryOptions (1933-1947)
  • WithSubgraphRetryOptions (1801-1805)
router/pkg/config/config.go (3)
  • TrafficShapingRules (164-171)
  • GlobalSubgraphRequestRule (190-207)
  • BackoffJitterRetry (231-238)
router-tests/circuit_breaker_test.go (2)
router-tests/testenv/testenv.go (10)
  • Run (107-124)
  • Config (286-342)
  • LogObservationConfig (388-391)
  • SubgraphsConfig (367-380)
  • SubgraphConfig (382-386)
  • WSReadMessage (2656-2697)
  • WSWriteMessage (2742-2783)
  • Environment (1729-1765)
  • WSWriteJSON (2785-2826)
  • WebSocketMessage (2281-2285)
router/pkg/config/config.go (2)
  • Config (987-1061)
  • EngineExecutionConfiguration (371-400)
router-tests/retry_test.go (4)
router/core/router.go (3)
  • NewSubgraphRetryOptions (1933-1947)
  • WithSubgraphRetryOptions (1801-1805)
  • Option (172-172)
router/pkg/config/config.go (4)
  • TrafficShapingRules (164-171)
  • GlobalSubgraphRequestRule (190-207)
  • BackoffJitterRetry (231-238)
  • Config (987-1061)
router/internal/retrytransport/manager.go (1)
  • OnRetryFunc (16-16)
router-tests/testenv/testenv.go (6)
  • Run (107-124)
  • RunWithError (137-152)
  • Environment (1729-1765)
  • WSReadMessage (2656-2697)
  • WSWriteMessage (2742-2783)
  • WSWriteJSON (2785-2826)
router/internal/retrytransport/retry_transport.go (2)
router/internal/retrytransport/manager.go (1)
  • Manager (33-39)
router/internal/expr/expr.go (1)
  • Request (66-75)
router/core/transport.go (4)
router/internal/retrytransport/manager.go (1)
  • Manager (33-39)
router/internal/traceclient/traceclient.go (1)
  • NewTraceInjectingRoundTripper (41-53)
router/internal/circuit/breaker.go (1)
  • NewCircuitTripper (18-25)
router/internal/retrytransport/retry_transport.go (1)
  • NewRetryHTTPTransport (76-83)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: image_scan
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_push_image
  • GitHub Check: integration_test (./events)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_test
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (13)
router-tests/websocket_test.go (1)

749-749: LGTM: exercising retry options in WS header/params forwarding test.

Good to include retry wiring here to catch regressions in the WS upgrade path. Please confirm option ordering with WithHeaderRules is order‑insensitive in RouterOptions.

router/internal/traceclient/traceclient.go (1)

109-116: Confirmed — constructor is always passed non-nil callbacks.
Only call site is router/core/transport.go:113 — it passes the locally defined getExprContext and getActiveSubgraphName (defined immediately above), so nil checks are unnecessary.

router/core/transport.go (7)

93-104: getActiveSubgraphName helper: LGTM.

Gracefully handles missing context and subgraph; consistent with downstream uses.


105-114: Trace-injecting RT wiring: LGTM.

Nil req-context returns empty expr.Context; signature matches traceclient.NewTraceInjectingRoundTripper.


116-118: Circuit breaker enriched with subgraph name: LGTM.

Parameter order matches internal API; no behavior concerns.


361-366: TransportFactory gains retryManager field: LGTM.

Plumbs manager through the factory cleanly.


403-405: NewTransport copies RetryManager: LGTM.


449-451: RoundTripper passes retryManager into NewCustomTransport: LGTM.


73-75: Approve: signature change to accept retryManager.
Only occurrences are the constructor and the single call at router/core/transport.go:443 — call site updated to pass retryManager.

router-tests/retry_test.go (4)

17-24: Helper is concurrency-safe and minimal. LGTM.

Captures retry count and duration without dereferencing resp/err; safe for nils.


39-52: Good adoption of NewSubgraphRetryOptions + WithSubgraphRetryOptions.

The builder-style opts + OnRetryFunc wiring is clear and consistent.


709-741: Per‑subgraph attempts are validated cleanly. LGTM.

Isolates employees/test1 and checks attempts = retries + 1.


787-818: Mixed All + per‑subgraph inheritance is covered. LGTM.

Confirms global defaults apply to employees and per‑subgraph override for test1.

Also applies to: 820-863

@SkArchon SkArchon force-pushed the milinda/retry-per-subgraphs branch from de80950 to f4090fa Compare September 16, 2025 16:08
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
router/internal/retrytransport/retry_transport.go (1)

134-136: Honor context cancellation during backoff sleep.

Uninterruptible time.Sleep blocks cancellation. Use a timer and select on req.Context().Done().

-		// Wait for the specified duration
-		time.Sleep(sleepDuration)
+		// Wait for the specified duration or until the request is canceled
+		if sleepDuration > 0 {
+			timer := time.NewTimer(sleepDuration)
+			select {
+			case <-req.Context().Done():
+				timer.Stop()
+				return resp, req.Context().Err()
+			case <-timer.C:
+			}
+		}
♻️ Duplicate comments (3)
router-tests/retry_test.go (2)

534-559: LGTM and consistent with team guidance: disabled retries ignore algorithm.

Matches prior guidance that algorithm is not validated when Enabled=false; good to lock with a test.


561-588: LGTM: per-subgraph “disabled ignores algorithm” case covered.

Mirrors the base case; prevents regressions.

router/internal/retrytransport/manager.go (1)

50-118: Make Initialize thread-safe and avoid stale/partial state (atomic map swap).

Initialize builds/assigns m.retries without any write lock, while readers use RLock. Also, repeated Initialize calls will leak stale entries (e.g., a subgraph disabled/removed later remains present). Build a new map, validate/fill it, then swap under a write lock.

Apply:

 func (m *Manager) Initialize(baseRetryOptions RetryOptions, subgraphRetryOptions map[string]RetryOptions, routerConfig *nodev1.RouterConfig) error {
+	// Build into a fresh map to avoid partial/dirty reads and stale entries.
+	nextRetries := make(map[string]*RetryOptions)

 	// Get the list of all subgraph AND feature subgraphs
 	subgraphNameSet := make(map[string]bool, len(routerConfig.Subgraphs))
@@
-				// Only assign default options if validation succeeds
-				for _, sgName := range defaultSgNames {
-					opts := baseRetryOptions
-					m.retries[sgName] = &opts
-				}
+				// Only assign default options if validation succeeds
+				for _, sgName := range defaultSgNames {
+					opts := baseRetryOptions // copy
+					nextRetries[sgName] = &opts
+				}
 			}
 		}
 	}
@@
-	for _, sgName := range customSgNames {
-		entry, ok := subgraphRetryOptions[sgName]
-		if !ok {
-			return fmt.Errorf("failed to add base retry expression: %s", sgName)
-		}
+	for _, sgName := range customSgNames {
+		entry := subgraphRetryOptions[sgName]
@@
-		opts := entry
-		m.retries[sgName] = &opts
+		opts := entry // copy
+		nextRetries[sgName] = &opts
 	}
 
-	return nil
+	// Atomic publish
+	m.lock.Lock()
+	m.retries = nextRetries
+	m.lock.Unlock()
+	return nil
 }
🧹 Nitpick comments (14)
router-tests/retry_test.go (5)

325-337: Fix flaky lower-bound assertion in “max duration is not exceeded” test.

The lower-bound check assumes near‑max jitter per attempt; with full jitter it can be much smaller, causing flakes. Assert an upper bound against (retries-1)*MaxDuration instead.

Apply this diff:

-      // We reduce by 100 for any jitter
-      expectedMinDuration := (time.Duration(maxRetryCount-1) * maxDuration) - (100 * time.Millisecond)
-      require.GreaterOrEqual(t, requestDuration, expectedMinDuration)
+      // Each sleep must not exceed MaxDuration; total must be <= (retries-1)*MaxDuration (+slack for processing)
+      maxAllowed := (time.Duration(maxRetryCount-1) * maxDuration) + (150 * time.Millisecond)
+      require.LessOrEqual(t, requestDuration, maxAllowed)

Also applies to: 367-373


392-405: Strengthen jitter assertions for 429 without Retry-After.

NotEqual to the base interval is fragile. Assert the jittered sleep is >0 and <= interval.

Apply this diff:

-        require.NotEqual(t, sleepDuration.Load(), int64(retryInterval))
+        require.Greater(t, time.Duration(sleepDuration.Load()), time.Duration(0))
+        require.LessOrEqual(t, time.Duration(sleepDuration.Load()), retryInterval)

Also applies to: 434-435


452-465: Mirror stronger jitter assertions for zero Retry-After case.

Same rationale as above; assert 0 < sleep <= interval.

Apply this diff:

-        require.NotEqual(t, sleepDuration.Load(), int64(retryInterval))
+        require.Greater(t, time.Duration(sleepDuration.Load()), time.Duration(0))
+        require.LessOrEqual(t, time.Duration(sleepDuration.Load()), retryInterval)

Also applies to: 495-496


931-979: Rename duplicate test title to avoid confusion.

Two subtests share the exact name “verify retries are applied when only all and no subgraph specific overrides are present”. Rename this one to reflect the FF scenario (e.g., “... under feature‑flagged subgraph (products_fg)”).

Apply this diff:

-t.Run("verify retries are applied when only all and no subgraph specific overrides are present", func(t *testing.T) {
+t.Run("verify retries are applied with only 'all' config under feature-flagged subgraph (products_fg)", func(t *testing.T) {

865-929: Add an inheritance test: partial per-subgraph override should merge with All.

We need to assert that unspecified per-subgraph fields inherit from All (e.g., Expression/Interval), as requested in review. Add a focused test that overrides only MaxAttempts for employees while inheriting Expression from All.

Apply this addition near this block:

+ t.Run("per-subgraph override inherits unspecified fields from all (expression + interval)", func(t *testing.T) {
+   t.Parallel()
+
+   employeesCalls := atomic.Int32{}
+   test1Calls := atomic.Int32{}
+
+   // All enables retries with expression targeting 502; employees overrides only MaxAttempts.
+   allMax := 4
+   empMax := 2
+   opts := core.NewSubgraphRetryOptions(config.TrafficShapingRules{
+     All: config.GlobalSubgraphRequestRule{
+       BackoffJitterRetry: config.BackoffJitterRetry{
+         Enabled:     true,
+         MaxAttempts: allMax,
+         MaxDuration: 2 * time.Second,
+         Interval:    10 * time.Millisecond,
+         Expression:  "statusCode == 502",
+       },
+     },
+     Subgraphs: map[string]config.GlobalSubgraphRequestRule{
+       "employees": {
+         BackoffJitterRetry: config.BackoffJitterRetry{
+           Enabled:     true,
+           MaxAttempts: empMax, // inherit Expression and Interval from All
+         },
+       },
+     },
+   })
+   options := core.WithSubgraphRetryOptions(opts)
+
+   testenv.Run(t, &testenv.Config{
+     NoRetryClient: true,
+     RouterOptions: []core.Option{options},
+     Subgraphs: testenv.SubgraphsConfig{
+       Employees: testenv.SubgraphConfig{
+         Middleware: func(_ http.Handler) http.Handler {
+           return http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+             w.WriteHeader(http.StatusBadGateway) // 502 matches All.Expression
+             employeesCalls.Add(1)
+           })
+         },
+       },
+       Test1: testenv.SubgraphConfig{
+         Middleware: func(_ http.Handler) http.Handler {
+           return http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+             w.WriteHeader(http.StatusBadGateway)
+             test1Calls.Add(1)
+           })
+         },
+       },
+     },
+   }, func(t *testing.T, xEnv *testenv.Environment) {
+     // employees inherits Expression/Interval, but uses empMax attempts
+     resEmp, err := xEnv.MakeGraphQLRequest(testenv.GraphQLRequest{
+       Query: `query employees { employees { id } }`,
+     })
+     require.NoError(t, err)
+     require.Contains(t, resEmp.Body, `"statusCode":502`)
+     require.Equal(t, int32(empMax+1), employeesCalls.Load())
+
+     // test1 uses All config entirely
+     resTest1, err := xEnv.MakeGraphQLRequest(testenv.GraphQLRequest{
+       Query: `query { floatField(arg: 1.5) }`,
+     })
+     require.NoError(t, err)
+     require.Contains(t, resTest1.Body, `"statusCode":502`)
+     require.Equal(t, int32(allMax+1), test1Calls.Load())
+   })
+ })
router/internal/retrytransport/manager.go (4)

102-104: Fix wrong algorithm value in error message.

The check uses entry.Algorithm, but the error prints baseRetryOptions.Algorithm.

-		return fmt.Errorf("unsupported retry algorithm for subgraph %s: %s", sgName, baseRetryOptions.Algorithm)
+		return fmt.Errorf("unsupported retry algorithm for subgraph %s: %s", sgName, entry.Algorithm)

97-101: Remove redundant map existence check and incorrect error text.

customSgNames is derived from subgraphRetryOptions, so ok is guaranteed true. The current error text is also misleading.

-	entry, ok := subgraphRetryOptions[sgName]
-	if !ok {
-		return fmt.Errorf("failed to add base retry expression: %s", sgName)
-	}
+	entry := subgraphRetryOptions[sgName]

50-62: Precondition: routerConfig must be non-nil.

Function dereferences routerConfig; if it's ever nil (e.g., during hot‑reload failure), this panics. If callers guarantee non‑nil, document it; otherwise guard early.

Would you like me to add a fast precondition check and unit test?


120-132: Avoid exposing internal pointers or document immutability.

GetSubgraphOptions returns a pointer from internal state. Either return a copy or clearly document that callers must treat the returned options as read‑only.

I can switch this to return a value (copy) with negligible overhead.

router/internal/retrytransport/retry_transport.go (5)

58-61: Support Retry-After on 503 (Service Unavailable) as well.

RFC allows Retry‑After with 503 in addition to 429.

-	if resp == nil || resp.StatusCode != http.StatusTooManyRequests {
+	if resp == nil || (resp.StatusCode != http.StatusTooManyRequests && resp.StatusCode != http.StatusServiceUnavailable) {
 		return 0, false
 	}

49-52: Lower log level for malformed Retry-After.

Parsing failures are client‑visible only when we intend to honor it; Debug is more appropriate to avoid noisy Error logs.

-		logger.Error("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin))
+		logger.Debug("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin))

92-98: Guard nil getActiveSubgraph to avoid panic; default to empty name.

-	activeSubgraph := rt.getActiveSubgraph(req)
+	activeSubgraph := ""
+	if rt.getActiveSubgraph != nil {
+		activeSubgraph = rt.getActiveSubgraph(req)
+	}
 	retryOptions := rt.retryManager.GetSubgraphOptions(activeSubgraph)

103-104: Default to a Nop logger when getRequestLogger is nil or returns nil.

Prevents panics and keeps logs safe.

-	requestLogger := rt.getRequestLogger(req)
+	requestLogger := zap.NewNop()
+	if rt.getRequestLogger != nil {
+		if l := rt.getRequestLogger(req); l != nil {
+			requestLogger = l
+		}
+	}

76-82: Provide a default RoundTripper when nil is passed.

Avoid nil deref if callers forget to set it.

 func NewRetryHTTPTransport(roundTripper http.RoundTripper, getRequestLogger requestLoggerGetter, retryManager *Manager, getActiveSubgraph func(req *http.Request) string) *RetryHTTPTransport {
+	if roundTripper == nil {
+		roundTripper = http.DefaultTransport
+	}
 	return &RetryHTTPTransport{
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f4090fa and bceee7e.

📒 Files selected for processing (4)
  • router-tests/retry_test.go (11 hunks)
  • router/core/graph_server.go (4 hunks)
  • router/internal/retrytransport/manager.go (1 hunks)
  • router/internal/retrytransport/retry_transport.go (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router/core/graph_server.go
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.650Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
📚 Learning: 2025-09-01T13:26:16.284Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/retry_builder.go:13-15
Timestamp: 2025-09-01T13:26:16.284Z
Learning: In router/core/retry_builder.go, the BuildRetryFunction's manager parameter is guaranteed to be non-nil by the calling code, so defensive nil checks are not needed according to the team's preference.

Applied to files:

  • router/internal/retrytransport/manager.go
  • router/internal/retrytransport/retry_transport.go
📚 Learning: 2025-09-16T16:00:04.078Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/transport.go:0-0
Timestamp: 2025-09-16T16:00:04.078Z
Learning: The IsEnabled() method on *retrytransport.Manager in router/core/transport.go safely handles nil receivers by returning false, so no external nil checks are needed when calling retryManager.IsEnabled().

Applied to files:

  • router/internal/retrytransport/manager.go
  • router/internal/retrytransport/retry_transport.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/internal/retrytransport/manager.go
  • router-tests/retry_test.go
  • router/internal/retrytransport/retry_transport.go
📚 Learning: 2025-09-01T13:44:13.045Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.045Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false as the first check, so no external nil checks are needed when calling c.retryOptions.IsEnabled(). This is a consistent pattern used across multiple IsEnabled() methods in the codebase.

Applied to files:

  • router/internal/retrytransport/manager.go
  • router-tests/retry_test.go
  • router/internal/retrytransport/retry_transport.go
📚 Learning: 2025-08-28T10:07:27.650Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.650Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.

Applied to files:

  • router/internal/retrytransport/manager.go
  • router-tests/retry_test.go
📚 Learning: 2025-09-01T13:44:13.045Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.045Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false, so no external nil checks are needed when calling c.retryOptions.IsEnabled().

Applied to files:

  • router/internal/retrytransport/manager.go
  • router-tests/retry_test.go
  • router/internal/retrytransport/retry_transport.go
📚 Learning: 2025-08-26T17:19:28.178Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/internal/expr/retry_context.go:3-10
Timestamp: 2025-08-26T17:19:28.178Z
Learning: In the Cosmo router codebase, prefer using netErr.Timeout() checks over explicit context.DeadlineExceeded checks for timeout detection in retry logic, as Go's networking stack typically wraps context deadline exceeded errors in proper net.Error implementations.

Applied to files:

  • router-tests/retry_test.go
📚 Learning: 2025-08-28T09:18:10.121Z
Learnt from: endigma
PR: wundergraph/cosmo#2141
File: router-tests/http_subscriptions_test.go:100-108
Timestamp: 2025-08-28T09:18:10.121Z
Learning: In router-tests/http_subscriptions_test.go heartbeat tests, the message ordering should remain strict with data messages followed by heartbeat messages, as the timing is deterministic and known by design in the Cosmo router implementation.

Applied to files:

  • router-tests/retry_test.go
🧬 Code graph analysis (3)
router/internal/retrytransport/manager.go (2)
router/internal/expr/expr.go (1)
  • Request (66-75)
router/internal/expr/retry_expression.go (1)
  • RetryExpressionManager (12-14)
router-tests/retry_test.go (4)
router/core/router.go (3)
  • NewSubgraphRetryOptions (1933-1947)
  • WithSubgraphRetryOptions (1801-1805)
  • Option (172-172)
router/pkg/config/config.go (4)
  • TrafficShapingRules (164-171)
  • GlobalSubgraphRequestRule (190-207)
  • BackoffJitterRetry (231-238)
  • Config (987-1061)
router/internal/retrytransport/manager.go (1)
  • OnRetryFunc (16-16)
router-tests/testenv/testenv.go (3)
  • Run (107-124)
  • RunWithError (137-152)
  • Environment (1729-1765)
router/internal/retrytransport/retry_transport.go (1)
router/internal/retrytransport/manager.go (1)
  • Manager (33-39)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./events)
  • GitHub Check: build_test
  • GitHub Check: build_test
  • GitHub Check: Analyze (go)
🔇 Additional comments (18)
router-tests/retry_test.go (14)

39-51: LGTM: base retry config wiring is correct.

Using NewSubgraphRetryOptions + WithSubgraphRetryOptions and OnRetryFunc in the mutation test is correct and aligns with the new API.


93-106: LGTM: expression=false disables retries as expected.

Solid check that no retries occur when the expression evaluates to false.


146-158: LGTM: “all-fail” path counts are correct.

Assertions for retry count and service calls (retries+1) are accurate.


200-212: LGTM: partial-fail then success scenario.

Good use of OnRetry counter to flip behavior at N and verify final success.


261-273: LGTM: Retry-After (seconds) honored.

Asserting sleepDuration equals header seconds is a good correctness check.


507-532: LGTM: invalid algorithm at base fails init.

Error string check is precise; good.


593-619: LGTM: invalid algorithm per-subgraph fails init.

Good negative coverage with clear error expectation.


624-647: LGTM: invalid base expression fails with actionable error.

The asserted message is specific; great.


649-677: LGTM: invalid per-subgraph expression fails with subgraph context.

Clear error surface including subgraph name.


682-707: LGTM: valid algorithm/expression happy-path init.

Covers the positive path explicitly.


715-785: LGTM: per-subgraph retry counts enforced independently.

Accurate expectations for employees vs test1.


796-863: LGTM: mixed All + per-subgraph overrides behave independently.

Solid coverage for defaulting employees to All and overriding test1.


873-929: LGTM: All-only defaults applied to multiple subgraphs.

Both subgraphs respect the global max attempts.


996-1079: LGTM: upgrade path retries fixed and validated.

Good regression test for the websocket upgrade path; the counter expectation (failedTries+1) is spot on.

router/internal/retrytransport/manager.go (2)

134-143: LGTM: IsEnabled implementation is safe and clear.

Simple length check under RLock is fine.


145-150: LGTM: Retry() nil-guard.

Returning false when retryFunc is nil prevents panics.

router/internal/retrytransport/retry_transport.go (2)

85-91: LGTM: single-attempt fast path before retry evaluation.

Early return on clean 2xx is consistent with your stated design.


153-171: LGTM: connection reuse via body drain is correct.

The drain+close pattern is appropriate for keep‑alive reuse.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
router/internal/expr/retry_expression.go (1)

36-40: Default expression helpers likely not visible to the VM (pointer env + run with &ctx).

Your default expression calls IsRetryableStatusCode()/IsConnectionError()/IsTimeout(), but the compile options only expose RetryContext as a value. If those helpers are pointer-receiver methods on RetryContext (common), they won’t be visible. Compile will fail or the functions will be unresolved.

  • Use a pointer env in expr.Env.
  • Run with &ctx to match the env.
  • If helpers aren’t methods, register them via expr.Function(...) during compile.

Apply this diff:

@@
-	options := []expr.Option{
-		expr.Env(RetryContext{}),
-		expr.AsKind(reflect.Bool),
-	}
+	options := []expr.Option{
+		// Pointer env so pointer‑receiver helpers on RetryContext are available.
+		expr.Env(&RetryContext{}),
+		expr.AsKind(reflect.Bool),
+		// If helpers are not methods on RetryContext, register them explicitly here via expr.Function(...).
+	}
@@
-	result, err := expr.Run(program, ctx)
+	// Run with pointer env to match expr.Env(&RetryContext{}).
+	result, err := expr.Run(program, &ctx)

Also applies to: 69-69, 16-16

♻️ Duplicate comments (1)
router/core/transport.go (1)

120-122: Relying on nil-safe IsEnabled() is fine here.

Acknowledged team convention that (*retrytransport.Manager).IsEnabled() handles nil receivers by returning false.

🧹 Nitpick comments (2)
router/internal/traceclient/traceclient.go (1)

117-122: Avoid emitting empty subgraph attribute.

Skip WgSubgraphName when getActiveSubgraphName(req) returns "". Keeps metrics cardinality cleaner.

Apply:

-        serverAttributes = append(
-            serverAttributes,
-            rotel.WgClientReusedConnection.Bool(trace.ConnectionAcquired.Reused),
-            rotel.WgSubgraphName.String(subgraph),
-        )
+        serverAttributes = append(
+            serverAttributes,
+            rotel.WgClientReusedConnection.Bool(trace.ConnectionAcquired.Reused),
+        )
+        if subgraph != "" {
+            serverAttributes = append(serverAttributes, rotel.WgSubgraphName.String(subgraph))
+        }
router/internal/expr/retry_expression.go (1)

25-31: Normalize keys; comment says “normalized” but no normalization is done.

Trim whitespace so logically identical expressions dedupe to the same key.

Apply this diff:

@@
-func (m *RetryExpressionManager) AddExpression(exprString string) error {
-	expression := exprString
+import (
+	"fmt"
+	"reflect"
+	"strings"
+	"sync"
+@@
+)
+@@
+func (m *RetryExpressionManager) AddExpression(exprString string) error {
+	expression := strings.TrimSpace(exprString)
@@
-	// Use the normalized expression string as the key for deduplication
+	// Use the trimmed expression string as the key for deduplication
 	m.expressionMap[expression] = program

Also trim in ShouldRetry:

-	expression := expressionString
+	expression := strings.TrimSpace(expressionString)

Also applies to: 47-49

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bceee7e and f353169.

📒 Files selected for processing (7)
  • router-tests/circuit_breaker_test.go (2 hunks)
  • router-tests/retry_test.go (11 hunks)
  • router/core/transport.go (6 hunks)
  • router/internal/expr/retry_expression.go (2 hunks)
  • router/internal/retrytransport/manager.go (1 hunks)
  • router/internal/retrytransport/retry_transport_test.go (18 hunks)
  • router/internal/traceclient/traceclient.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • router-tests/circuit_breaker_test.go
  • router/internal/retrytransport/manager.go
  • router/internal/retrytransport/retry_transport_test.go
  • router-tests/retry_test.go
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.650Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
📚 Learning: 2025-09-03T11:38:45.855Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2183
File: router/internal/traceclient/traceclient.go:115-118
Timestamp: 2025-09-03T11:38:45.855Z
Learning: In the Cosmo codebase, the exprContext obtained from getExprContext in TraceInjectingRoundTripper.processConnectionMetrics is expected to always be non-nil as part of the existing system behavior, so defensive nil checks are not needed.

Applied to files:

  • router/internal/traceclient/traceclient.go
📚 Learning: 2025-09-01T13:26:16.284Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/retry_builder.go:13-15
Timestamp: 2025-09-01T13:26:16.284Z
Learning: In router/core/retry_builder.go, the BuildRetryFunction's manager parameter is guaranteed to be non-nil by the calling code, so defensive nil checks are not needed according to the team's preference.

Applied to files:

  • router/core/transport.go
📚 Learning: 2025-09-16T16:00:04.078Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/transport.go:0-0
Timestamp: 2025-09-16T16:00:04.078Z
Learning: The IsEnabled() method on *retrytransport.Manager in router/core/transport.go safely handles nil receivers by returning false, so no external nil checks are needed when calling retryManager.IsEnabled().

Applied to files:

  • router/core/transport.go
📚 Learning: 2025-09-01T13:44:13.045Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.045Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false as the first check, so no external nil checks are needed when calling c.retryOptions.IsEnabled(). This is a consistent pattern used across multiple IsEnabled() methods in the codebase.

Applied to files:

  • router/core/transport.go
📚 Learning: 2025-09-01T13:44:13.045Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.045Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false, so no external nil checks are needed when calling c.retryOptions.IsEnabled().

Applied to files:

  • router/core/transport.go
🧬 Code graph analysis (3)
router/internal/traceclient/traceclient.go (2)
router/pkg/metric/connection_metric_store.go (1)
  • ConnectionMetricStore (25-29)
router/internal/expr/expr.go (4)
  • Context (35-39)
  • Request (66-75)
  • Subgraph (141-146)
  • ClientTrace (135-138)
router/core/transport.go (4)
router/internal/retrytransport/manager.go (1)
  • Manager (33-39)
router/internal/traceclient/traceclient.go (1)
  • NewTraceInjectingRoundTripper (41-53)
router/internal/circuit/breaker.go (1)
  • NewCircuitTripper (17-24)
router/internal/retrytransport/retry_transport.go (1)
  • NewRetryHTTPTransport (76-83)
router/internal/expr/retry_expression.go (1)
router/internal/expr/retry_context.go (1)
  • RetryContext (13-19)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: build_test
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./events)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_push_image
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (10)
router/internal/traceclient/traceclient.go (2)

35-38: Good DI split: expr context vs. subgraph name.

Separating these callbacks makes the round-tripper reusable and easier to test. Constructor wiring looks consistent.

Also applies to: 41-46, 48-52


109-114: Confirm expr path: Subgraph.Request.ClientTrace.

Please verify this field path matches expr.Context schema used elsewhere (some code paths historically used Request.Trace.*). If this struct layout changed in this PR, ensure all evaluators reference the new location.

router/core/transport.go (7)

85-91: No-op logger fallback is sensible.

Returning zap.NewNop() when request context is absent avoids panics in downstream wrappers.


106-114: Expr context getter: safe default.

Returning &expr.Context{} when req context is nil avoids nil deref in the trace round-tripper while keeping behavior predictable.


116-118: Circuit tripper now subgraph-aware.

Passing getActiveSubgraphName aligns breaker metrics/keys with per-subgraph behavior.


361-383: Options/Factory wiring for RetryManager looks consistent.

Field propagation is straightforward; no extraneous state carried over from the removed RetryOptions.


403-405: Factory initialization passes RetryManager correctly.

No concerns.


449-451: Transport constructed with RetryManager (and subgraph-aware wrappers).

End-to-end wiring looks correct.


93-104: Subgraph name resolver — verify Upgrade/WS/SSE paths

Auto-verification failed: router-tests/tests not found. Confirm ActiveSubgraph(req) is populated for Upgrade (WebSocket) and SSE (text/event-stream) requests so circuit-breaker/retry get a meaningful subgraph; otherwise add a fallback (e.g., from engine hooks or a request-context key). File: router/core/transport.go:93-104

router/internal/expr/retry_expression.go (1)

25-31: No change needed — AddExpression only called for enabled retries.

Initialize() in router/internal/retrytransport/manager.go filters subgraphs into defaultSgNames/customSgNames and only calls m.exprManager.AddExpression(...) when baseRetryOptions.Enabled or entry.Enabled, so disabled subgraphs won’t trigger compilation of the defaultRetryExpression.

@SkArchon
Copy link
Contributor Author

@coderabbitai pause

@coderabbitai
Copy link

coderabbitai bot commented Sep 17, 2025

✅ Actions performed

Reviews paused.

@github-actions
Copy link

github-actions bot commented Oct 2, 2025

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Oct 2, 2025
@github-actions
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Oct 16, 2025
@SkArchon SkArchon reopened this Oct 16, 2025
@SkArchon
Copy link
Contributor Author

@StarpTech Could I get a re-review (remaining on my side to fix circuit breaker test)

@github-actions github-actions bot removed the Stale label Oct 17, 2025
@github-actions
Copy link

github-actions bot commented Nov 1, 2025

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Nov 1, 2025
@github-actions
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Nov 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants