Skip to content

Lake Table Schema Change Job Permanently Stuck Due to Version Discontinuity in Shared-Data Mode #64757

@nancodex

Description

@nancodex

Lake Table Schema Change Job Permanently Stuck Due to Version Discontinuity in Shared-Data Mode

StarRocks version (Required)

  • StarRocks 3.5.3

Steps to reproduce the behavior (Required)

  1. CREATE TABLE speechai.ods_server_log_test003 (Lake table with date partitioning in shared-data mode)
  2. INSERT INTO speechai.ods_server_log_test003 (heavy concurrent writes, producing version 10828)
  3. ALTER TABLE speechai.ods_server_log_test003 (schema change job starts, JobId: 509692)
  4. Continue heavy INSERT operations while ALTER job reaches FINISHED_REWRITING state
  5. Observe ALTER job stuck in FINISHED_REWRITING state indefinitely
  6. Attempt new ALTER operation: ALTER TABLE speechai.ods_server_log_test003 COMPACT;

Expected behavior (Required)

  • ALTER TABLE job should complete successfully or fail gracefully with retry capability
  • Version sequence should be continuous without gaps
  • System should handle concurrent transactions and ALTER operations without conflicts
  • New ALTER operations should be allowed after previous job completes

Real behavior (Required)

  • ALTER job stuck permanently in FINISHED_REWRITING state
  • Partition version discontinuity: VisibleVersion: 10828, NextVersion: 12830 (missing version 10829)
  • Subsequent transactions fail with: partition.getVisibleVersion() + 1 != version.get(0) 395376 10828 10830
  • New ALTER operations blocked: ERROR 1064 (HY000): A schema change operation is in progress
  • Table state locked as SCHEMA_CHANGE, rendering table completely unusable
  • Issue persists across FE restarts

Error Logs

2025-10-29 15:03:41.795+08:00 ERROR (lake-publish-task-117|357) [PublishVersionDaemon.publishPartitionBatch():530]
partition.getVisibleVersion() + 1 != version.get(0) 395376 10828 10830

Root Cause

  • Version management race condition in shared-data mode: ALTER job and concurrent transactions compete for database read locks
  • Strict version continuity check: LakeTableSchemaChangeJob.readyToPublishVersion() requires commitVersion == partition.getVisibleVersion() + 1
  • Lock contention: ALTER job cannot acquire lock in time to publish version 10829
  • Version gap: Concurrent transactions skip version 10829 and publish version 10830, breaking version sequence
  • Table state lock: Table state set to SCHEMA_CHANGE, blocking all subsequent ALTER operations

Impact

  • Critical: Table completely unusable, all ALTER operations blocked
  • High: Data ingestion failures, version publish errors
  • Medium: Query performance degradation

Additional Information

  • Architecture: Shared-data mode (compute-storage separation)
  • Partition: PartitionId 395376, Size 65.7GB, Rows 177,415,796
  • Version Gap: Missing version 10829 (sequence: 10828 → 12830)
  • Persistence: Issue survives FE restarts due to serialized metadata

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions