Skip to content

Feature Request - Batch Deployment for Historical Nodes #237

@aruraghuwanshi

Description

@aruraghuwanshi

Problem Statement
In many Apache Druid implementations, replicas are distributed across Availability Zones (AZs) with each AZ containing its own historical tier. For large Druid clusters, this implies that there are multiple historicals per tier. With every upgrade via a rolling restart, it takes hours for the deployment to complete.
Given that replicas exist across different tiers, this implies that more than one historicals have the potential to be taken down and rolled out within the same historical tier.

Solution Overview
We have implemented a custom batch deployment feature for the Druid Operator that allows users to specify how many historicals can be taken down simultaneously during rolling updates, significantly reducing deployment time while maintaining data availability.

Key Features

  1. Configurable Batch Sizes
  • Specify percentage of pods to delete in parallel per historical tier
  • Percentage-based calculation scales automatically with cluster size
  1. Health-Aware Operations
  • Validates other historical tiers are healthy before proceeding, preventing datasource unavailability
  • Optional health check bypass for urgent deployments
  1. Persistent State Management
  • Tracks operations across reconciliation cycles
  • Prevents conflicting batch operations on the same StatefulSet
  1. Safe Deletion Strategy
  • Deletes highest ordinal pods first (StatefulSet best practice)
  • Waits for pod recreation and readiness before continuing

If this is something that the community is interested in, we can start a conversation with a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions