Skip to content

Conversation

@segfault-magnet
Copy link
Contributor

@segfault-magnet segfault-magnet commented Apr 6, 2025

closes: #55

How the Failover Logic Works

The FailoverClient checks provider health whenever we're about to send a request. If the active provider is marked unhealthy, it automatically connects to the next one. If no providers are healthy, it returns an error. Here's the general flow:

  1. Initialization:

    • The client starts with a list of eth provider endpoints and picks the first available one as "active."
  2. When a Request Arrives:

    • The committer calls a method like get_block_number or submit on the FailoverClient.
  3. Health Check:

    • The active provider's status is examined based on:
      • Permanent Failures (fatal errors) -- things that we cannot recover
        from like alloy permanently dropping the background task which keeps the websocket
        connection alive.
      • Transient Errors (recoverable network issues) -- network hiccups
      • Mempool Drops (lost transactions) -- behavior we experienced
        with Alchemy -- transactions continually being squeezed out.
    • If these exceed certain thresholds, the provider is considered unhealthy.
  4. Forward or Switch:

    • If healthy: Forward the request, reset transient error counters on success, or increment them on failure.
    • If unhealthy: Find a new provider in the list and retry. If none are healthy, return an "all providers unhealthy" error.
  5. Mempool Drops:

    • If the system detects a lost transaction, note_mempool_drop is called. This bumps the failure counter for the active provider, potentially making it unhealthy for future requests.

Below are two diagrams to illustrate this:

  • High-Level:

    graph LR
        Committer --> FailoverClient;
        FailoverClient --> ProviderA["Provider A (Active)"];
        FailoverClient -.-> ProviderB["Provider B (Standby)"];
        FailoverClient -.-> ProviderC["Provider C (Standby)"];
        ProviderA --> L1NodeA["L1 Node A"];
        ProviderB --> L1NodeB["L1 Node B"];
        ProviderC --> L1NodeC["L1 Node C"];
    
        subgraph "L1 Providers"
            ProviderA
            ProviderB
            ProviderC
        end
    
    Loading
  • On-Demand Failover Sequence:

    sequenceDiagram
        participant C as Committer
        participant FC as FailoverClient
        participant PA as Provider A
        participant PB as Provider B
    
        C->>FC: Send L1 Request (e.g., submit)
        activate FC
    
        FC->>FC: Check Health of Provider A
        alt Provider A is Healthy
            FC->>PA: Forward Request
            activate PA
            alt Request Succeeds
                PA-->>FC: Success
                FC->>FC: Reset transient error count for PA
            else Request Fails (Transient Error)
                PA-->>FC: Error
                FC->>FC: Increment transient error count for PA
            end
            deactivate PA
            FC-->>C: Response (Success or Error)
        else Provider A is Unhealthy
            Note right of FC: Provider A marked Unhealthy
            FC->>FC: Try connecting to next provider (Provider B)
            alt Connection to Provider B Successful
                FC->>PB: Set Provider B as Active
                FC->>PB: Forward Original Request
                activate PB
                PB-->>FC: Success
                deactivate PB
                FC-->>C: Response (Success)
            else Connection to Provider B Fails
                 Note right of FC: No healthy providers left!
                 FC-->>C: Error (All providers unhealthy)
            end
        end
        deactivate FC
    
    Loading

Note
We don't cycle through the available providers but rather opt for kubernetes to restart us if we exhaust all endpoints. I opted for this due to the background work done by alloy to track websocket connections -- it can sometimes drop a task so that the connection can never recover until we restart the app.

Notable Changes

  1. Failover Client (FailoverClient)

    • Found in packages/adapters/eth/src/failover.rs (main logic in failover/client.rs).
    • Manages a list of providers and tracks which one is active.
    • When a request fails or a provider is marked unhealthy, it looks for the next healthy one.
  2. Refactored WebSocket Adapter

    • The WebSocket logic is now in packages/adapters/eth/src/websocket/.
    • The old websocket.rs has merged with the old connection.rs
  3. Health Tracking

    • We removed the old HealthTrackingMiddleware.
    • All health checks now happen inside the FailoverClient (execute_operation and get_healthy_provider).
    • Health is tracked via:
      • Permanent failures (fatal errors)
      • Consecutive transient errors (reset on success)
      • Mempool drop counts (tracked by note_mempool_drop)
  4. Mempool Drop Reporting

    • StateListener in packages/services/src/state_listener.rs calls l1_adapter.note_mempool_drop() when it suspects a transaction is lost.
    • This increments a counter for the currently active provider, possibly marking it unhealthy.
  5. Async Health Checks

    • The HealthCheck trait in packages/metrics uses async fn healthy().
  6. Network Error Metric

    • eth_network_errors is now a counter labeled by the provider name.

Deployment Considerations

  1. Configuration:

    • You'll need multiple WebSocket URLs (e.g., via ETH_ENDPOINTS as [{"name":"main","url":"wss://ethereum.example.com"}, {"name":"backup","url":"wss://backup.example.com"}]
    • New environment variables for thresholds:
      • ETH_FAILOVER_TRANSIENT_ERROR_THRESHOLD
      • ETH_FAILOVER_MEMPOOL_DROP_THRESHOLD
      • ETH_FAILOVER_MEMPOOL_DROP_WINDOW
  2. Metrics Updates:

    • Added:
      • eth_mempool_drops (counter, labeled by provider)
    • Changed:
      • eth_network_errors now increments inside the FailoverClient, labeled by the provider name.

Action Required

  • Deployment Configuration:
    • Configure multiple eth endpoints and set the new environment variables for thresholds.
  • Monitoring Dashboards/Alerts:
    • Update dashboards and alert rules to account for the new provider_name labels and the new eth_mempool_drops metric.

@segfault-magnet segfault-magnet self-assigned this Apr 8, 2025
@segfault-magnet segfault-magnet marked this pull request as ready for review April 8, 2025 13:34
@segfault-magnet segfault-magnet requested a review from a team as a code owner April 8, 2025 13:34
@segfault-magnet segfault-magnet added big enhancement New feature or request labels Apr 8, 2025
Copy link

@AurelienFT AurelienFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my first look at this repository and so I don't have the full vision but it seems that making everything using sync primitives everywhere only for the HealthChecker and send operation seems a lot of overhead but maybe there is part i'm missing or constraints I don't understand. Overall, I liked reviewed this code that was well organized and provide a good customized solution to failovers

Copy link

@AurelienFT AurelienFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement to the reliability of the code, really enjoyed review this. Nice one !

Copy link

@netrome netrome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some opinions and suggestions from my side, otherwise it looks good to me. Only strong opinion I have is about the naming of operation_factory. Apologies for taking so long with the review.

self.app.tx_fees.min_reward_perc,
self.app.tx_fees.max_reward_perc,
)
.expect("already validated via `validate` in main"),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opinion: Ideally we'd leverage the type system to guarantee this property, as validation logic is prone to change over time. While I think this is fine for now, in similar situations I'd consider defining a ValidatedConfig struct that holds the converted types directly and have validate convert to this struct.

/// Multiple Ethereum RPC endpoints as a JSON array.
/// Format: '[{"name":"main","url":"wss://ethereum.example.com"}, {"name":"backup","url":"wss://backup.example.com"}]'
#[serde(deserialize_with = "parse_endpoints")]
pub endpoints: NonEmpty<Endpoint>,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice use of NonEmpty 🤩

Comment on lines +177 to +191
async fn execute_operation<F, Fut, T>(&self, operation_factory: F) -> EthResult<T>
where
F: FnOnce(Arc<I::Provider>) -> Fut + Send + Sync,
Fut: std::future::Future<Output = EthResult<T>> + Send,
T: Send,
{
let provider = self.get_healthy_provider().await?;
let provider_name = provider.name.as_str();

debug!("Executing operation on provider '{}'", provider_name);

let result = operation_factory(Arc::clone(&provider.provider)).await;

match result {
Ok(value) => {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If operation_factory was a factory, I'd expect the output of the function to be an operation, not the result of calling an operation. It seems like this is just the operation, and hence I think we should rename it.

Suggested change
async fn execute_operation<F, Fut, T>(&self, operation_factory: F) -> EthResult<T>
where
F: FnOnce(Arc<I::Provider>) -> Fut + Send + Sync,
Fut: std::future::Future<Output = EthResult<T>> + Send,
T: Send,
{
let provider = self.get_healthy_provider().await?;
let provider_name = provider.name.as_str();
debug!("Executing operation on provider '{}'", provider_name);
let result = operation_factory(Arc::clone(&provider.provider)).await;
match result {
Ok(value) => {
async fn execute_operation<F, Fut, T>(&self, operation: F) -> EthResult<T>
where
F: FnOnce(Arc<I::Provider>) -> Fut + Send + Sync,
Fut: std::future::Future<Output = EthResult<T>> + Send,
T: Send,
{
let provider = self.get_healthy_provider().await?;
let provider_name = provider.name.as_str();
debug!("Executing operation on provider '{}'", provider_name);
let result = operation(Arc::clone(&provider.provider)).await;
match result {
Ok(value) => {

Comment on lines +37 to +57
pub type WsProvider = alloy::providers::fillers::FillProvider<
alloy::providers::fillers::JoinFill<
alloy::providers::fillers::JoinFill<
alloy::providers::Identity,
alloy::providers::fillers::JoinFill<
alloy::providers::fillers::GasFiller,
alloy::providers::fillers::JoinFill<
alloy::providers::fillers::BlobGasFiller,
alloy::providers::fillers::JoinFill<
alloy::providers::fillers::NonceFiller,
alloy::providers::fillers::ChainIdFiller,
>,
>,
>,
>,
alloy::providers::fillers::WalletFiller<EthereumWallet>,
>,
alloy::providers::RootProvider<alloy::pubsub::PubSubFrontend>,
alloy::pubsub::PubSubFrontend,
Ethereum,
>;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh... This will render horribly in error messages and any place that isn't aware of the type alias. Perhaps we should define a newtype over it to make it more maintainable? Or any other ideas on how to manage this behemoth of a type signature?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

big enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make block committer resilient to provider outage

4 participants