feat: failover for eth rpcs #217

segfault-magnet · 2025-04-06T13:31:20Z

closes: #55

How the Failover Logic Works

The FailoverClient checks provider health whenever we're about to send a request. If the active provider is marked unhealthy, it automatically connects to the next one. If no providers are healthy, it returns an error. Here's the general flow:

Initialization:
- The client starts with a list of eth provider endpoints and picks the first available one as "active."
When a Request Arrives:
- The committer calls a method like get_block_number or submit on the FailoverClient.
Health Check:
- The active provider's status is examined based on:
  - Permanent Failures (fatal errors) -- things that we cannot recover
    from like alloy permanently dropping the background task which keeps the websocket
    connection alive.
  - Transient Errors (recoverable network issues) -- network hiccups
  - Mempool Drops (lost transactions) -- behavior we experienced
    with Alchemy -- transactions continually being squeezed out.
- If these exceed certain thresholds, the provider is considered unhealthy.
Forward or Switch:
- If healthy: Forward the request, reset transient error counters on success, or increment them on failure.
- If unhealthy: Find a new provider in the list and retry. If none are healthy, return an "all providers unhealthy" error.
Mempool Drops:
- If the system detects a lost transaction, note_mempool_drop is called. This bumps the failure counter for the active provider, potentially making it unhealthy for future requests.

Below are two diagrams to illustrate this:

High-Level:

graph LR
    Committer --> FailoverClient;
    FailoverClient --> ProviderA["Provider A (Active)"];
    FailoverClient -.-> ProviderB["Provider B (Standby)"];
    FailoverClient -.-> ProviderC["Provider C (Standby)"];
    ProviderA --> L1NodeA["L1 Node A"];
    ProviderB --> L1NodeB["L1 Node B"];
    ProviderC --> L1NodeC["L1 Node C"];

    subgraph "L1 Providers"
        ProviderA
        ProviderB
        ProviderC
    end

On-Demand Failover Sequence:

sequenceDiagram
    participant C as Committer
    participant FC as FailoverClient
    participant PA as Provider A
    participant PB as Provider B

    C->>FC: Send L1 Request (e.g., submit)
    activate FC

    FC->>FC: Check Health of Provider A
    alt Provider A is Healthy
        FC->>PA: Forward Request
        activate PA
        alt Request Succeeds
            PA-->>FC: Success
            FC->>FC: Reset transient error count for PA
        else Request Fails (Transient Error)
            PA-->>FC: Error
            FC->>FC: Increment transient error count for PA
        end
        deactivate PA
        FC-->>C: Response (Success or Error)
    else Provider A is Unhealthy
        Note right of FC: Provider A marked Unhealthy
        FC->>FC: Try connecting to next provider (Provider B)
        alt Connection to Provider B Successful
            FC->>PB: Set Provider B as Active
            FC->>PB: Forward Original Request
            activate PB
            PB-->>FC: Success
            deactivate PB
            FC-->>C: Response (Success)
        else Connection to Provider B Fails
             Note right of FC: No healthy providers left!
             FC-->>C: Error (All providers unhealthy)
        end
    end
    deactivate FC

Note
We don't cycle through the available providers but rather opt for kubernetes to restart us if we exhaust all endpoints. I opted for this due to the background work done by alloy to track websocket connections -- it can sometimes drop a task so that the connection can never recover until we restart the app.

Notable Changes

Failover Client (FailoverClient)
- Found in packages/adapters/eth/src/failover.rs (main logic in failover/client.rs).
- Manages a list of providers and tracks which one is active.
- When a request fails or a provider is marked unhealthy, it looks for the next healthy one.
Refactored WebSocket Adapter
- The WebSocket logic is now in packages/adapters/eth/src/websocket/.
- The old websocket.rs has merged with the old connection.rs
Health Tracking
- We removed the old HealthTrackingMiddleware.
- All health checks now happen inside the FailoverClient (execute_operation and get_healthy_provider).
- Health is tracked via:
  - Permanent failures (fatal errors)
  - Consecutive transient errors (reset on success)
  - Mempool drop counts (tracked by note_mempool_drop)
Mempool Drop Reporting
- StateListener in packages/services/src/state_listener.rs calls l1_adapter.note_mempool_drop() when it suspects a transaction is lost.
- This increments a counter for the currently active provider, possibly marking it unhealthy.
Async Health Checks
- The HealthCheck trait in packages/metrics uses async fn healthy().
Network Error Metric
- eth_network_errors is now a counter labeled by the provider name.

Deployment Considerations

Configuration:
- You'll need multiple WebSocket URLs (e.g., via ETH_ENDPOINTS as [{"name":"main","url":"wss://ethereum.example.com"}, {"name":"backup","url":"wss://backup.example.com"}]
- New environment variables for thresholds:
  - ETH_FAILOVER_TRANSIENT_ERROR_THRESHOLD
  - ETH_FAILOVER_MEMPOOL_DROP_THRESHOLD
  - ETH_FAILOVER_MEMPOOL_DROP_WINDOW
Metrics Updates:
- Added:
  - eth_mempool_drops (counter, labeled by provider)
- Changed:
  - eth_network_errors now increments inside the FailoverClient, labeled by the provider name.

Action Required

Deployment Configuration:
- Configure multiple eth endpoints and set the new environment variables for thresholds.
Monitoring Dashboards/Alerts:
- Update dashboards and alert rules to account for the new provider_name labels and the new eth_mempool_drops metric.

AurelienFT

It's my first look at this repository and so I don't have the full vision but it seems that making everything using sync primitives everywhere only for the HealthChecker and send operation seems a lot of overhead but maybe there is part i'm missing or constraints I don't understand. Overall, I liked reviewed this code that was well organized and provide a good customized solution to failovers

e2e/helpers/src/committer.rs

README.md

packages/adapters/eth/src/websocket/factory.rs

packages/adapters/eth/src/failover/client.rs

packages/adapters/eth/src/failover/error_tracker.rs

Co-authored-by: AurelienFT <[email protected]>

dedicated for that

AurelienFT

Nice improvement to the reliability of the code, really enjoyed review this. Nice one !

netrome

Some opinions and suggestions from my side, otherwise it looks good to me. Only strong opinion I have is about the naming of operation_factory. Apologies for taking so long with the review.

netrome · 2025-04-30T12:10:48Z

committer/src/config.rs

+                self.app.tx_fees.min_reward_perc,
+                self.app.tx_fees.max_reward_perc,
+            )
+            .expect("already validated via `validate` in main"),


Opinion: Ideally we'd leverage the type system to guarantee this property, as validation logic is prone to change over time. While I think this is fine for now, in similar situations I'd consider defining a ValidatedConfig struct that holds the converted types directly and have validate convert to this struct.

netrome · 2025-04-30T12:12:18Z

committer/src/config.rs

+    /// Multiple Ethereum RPC endpoints as a JSON array.
+    /// Format: '[{"name":"main","url":"wss://ethereum.example.com"}, {"name":"backup","url":"wss://backup.example.com"}]'
+    #[serde(deserialize_with = "parse_endpoints")]
+    pub endpoints: NonEmpty<Endpoint>,


Nice use of NonEmpty 🤩

netrome · 2025-04-30T12:16:20Z

packages/adapters/eth/src/failover/client.rs

+    async fn execute_operation<F, Fut, T>(&self, operation_factory: F) -> EthResult<T>
+    where
+        F: FnOnce(Arc<I::Provider>) -> Fut + Send + Sync,
+        Fut: std::future::Future<Output = EthResult<T>> + Send,
+        T: Send,
+    {
+        let provider = self.get_healthy_provider().await?;
+        let provider_name = provider.name.as_str();
+
+        debug!("Executing operation on provider '{}'", provider_name);
+
+        let result = operation_factory(Arc::clone(&provider.provider)).await;
+
+        match result {
+            Ok(value) => {


If operation_factory was a factory, I'd expect the output of the function to be an operation, not the result of calling an operation. It seems like this is just the operation, and hence I think we should rename it.

Suggested change

async fn execute_operation<F, Fut, T>(&self, operation_factory: F) -> EthResult<T>

where

F: FnOnce(Arc<I::Provider>) -> Fut + Send + Sync,

Fut: std::future::Future<Output = EthResult<T>> + Send,

T: Send,

{

let provider = self.get_healthy_provider().await?;

let provider_name = provider.name.as_str();

debug!("Executing operation on provider '{}'", provider_name);

let result = operation_factory(Arc::clone(&provider.provider)).await;

match result {

Ok(value) => {

async fn execute_operation<F, Fut, T>(&self, operation: F) -> EthResult<T>

where

F: FnOnce(Arc<I::Provider>) -> Fut + Send + Sync,

Fut: std::future::Future<Output = EthResult<T>> + Send,

T: Send,

{

let provider = self.get_healthy_provider().await?;

let provider_name = provider.name.as_str();

debug!("Executing operation on provider '{}'", provider_name);

let result = operation(Arc::clone(&provider.provider)).await;

match result {

Ok(value) => {

netrome · 2025-04-30T12:19:06Z

packages/adapters/eth/src/websocket.rs

+pub type WsProvider = alloy::providers::fillers::FillProvider<
+    alloy::providers::fillers::JoinFill<
+        alloy::providers::fillers::JoinFill<
+            alloy::providers::Identity,
+            alloy::providers::fillers::JoinFill<
+                alloy::providers::fillers::GasFiller,
+                alloy::providers::fillers::JoinFill<
+                    alloy::providers::fillers::BlobGasFiller,
+                    alloy::providers::fillers::JoinFill<
+                        alloy::providers::fillers::NonceFiller,
+                        alloy::providers::fillers::ChainIdFiller,
+                    >,
+                >,
+            >,
+        >,
+        alloy::providers::fillers::WalletFiller<EthereumWallet>,
+    >,
+    alloy::providers::RootProvider<alloy::pubsub::PubSubFrontend>,
+    alloy::pubsub::PubSubFrontend,
+    Ethereum,
+>;


Ugh... This will render horribly in error messages and any place that isn't aware of the type alias. Perhaps we should define a newtype over it to make it more maintainable? Or any other ideas on how to manage this behemoth of a type signature?

segfault-magnet added 30 commits March 31, 2025 09:11

work towards failovering

d628cac

signers are now cloneable

fed8b0f

fix mock setup

6d48af2

working almost as expected

aa09ebc

make client a generic param

b490ae5

refactoring

806907b

shift to mutex

3532968

track state in arc

7fff72e

move state under one mutex

ca56b27

implement l1 provider trait for failover client

10d81a4

use initializer as associated method

4ded54a

max transient errors is configurable

fc550a6

websocket client factory

c221709

inline websocket connection

d9a1819

cleanup

97cb0ff

extracted metrics

0318913

extracted factory

f05db45

fix imports

c0e6e55

group imports

4f58a56

rename

bdad767

health checker

1ea96f4

enable logs

ffca714

propagate multiple rpcs to config

8aa7b3d

window for detecting too many failed transactions

086c06b

tests

580669d

remove tests

584dc7c

fix expectations on mock

cb79a53

add tests for note_tx_failure

8131fe4

remove multiple pending tx test since that currently cannot happen

cb4b54d

tests compiling

871ba18

segfault-magnet added 3 commits April 8, 2025 14:26

remove redundant once

19e0a6b

remove unused method

2229f5a

remove comment from run tests script

080cb43

segfault-magnet self-assigned this Apr 8, 2025

segfault-magnet and others added 2 commits April 8, 2025 15:20

Merge branch 'master' into feat/failover_api

8d26bcb

final nits

ebad1d2

segfault-magnet marked this pull request as ready for review April 8, 2025 13:34

segfault-magnet requested a review from a team as a code owner April 8, 2025 13:34

segfault-magnet added big enhancement New feature or request labels Apr 8, 2025

segfault-magnet requested review from AurelienFT, Dentosal, MitchTurner, hal3e, netrome, rymnc and xgreenx April 8, 2025 13:52

AurelienFT reviewed Apr 10, 2025

View reviewed changes

segfault-magnet and others added 6 commits April 10, 2025 14:54

Update e2e/helpers/src/committer.rs

1c7c56c

Co-authored-by: AurelienFT <[email protected]>

bump transient errors in e2e committer setup

70316e5

rename to mempool drop

b5cf1f3

merge usage of crate and super

69df084

optimize window retention in the error tracker of the failover client

d7b9e8c

optimize too-many-tx-drops detection

eb64834

segfault-magnet requested a review from AurelienFT April 11, 2025 05:42

segfault-magnet and others added 3 commits April 12, 2025 15:20

Update packages/adapters/eth/src/failover/client.rs

75fa395

Co-authored-by: AurelienFT <[email protected]>

save iterator instead of whole colection

0a4c61f

health tracking to be moved out of the client into a separate struct

941911a

dedicated for that

AurelienFT approved these changes Apr 13, 2025

View reviewed changes

netrome approved these changes Apr 30, 2025

View reviewed changes

feat: failover for eth rpcs #217

Are you sure you want to change the base?

feat: failover for eth rpcs #217

Uh oh!

Conversation

segfault-magnet commented Apr 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How the Failover Logic Works

Notable Changes

Deployment Considerations

Action Required

Uh oh!

AurelienFT left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AurelienFT left a comment

Choose a reason for hiding this comment

Uh oh!

netrome left a comment

Choose a reason for hiding this comment

Uh oh!

netrome Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

netrome Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

netrome Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

netrome Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

segfault-magnet commented Apr 6, 2025 •

edited

Loading