-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Open
Labels
type/bugSomething isn't workingSomething isn't working
Description
Lake Table Schema Change Job Permanently Stuck Due to Version Discontinuity in Shared-Data Mode
StarRocks version (Required)
- StarRocks 3.5.3
Steps to reproduce the behavior (Required)
- CREATE TABLE
speechai.ods_server_log_test003(Lake table with date partitioning in shared-data mode) - INSERT INTO
speechai.ods_server_log_test003(heavy concurrent writes, producing version 10828) - ALTER TABLE
speechai.ods_server_log_test003(schema change job starts, JobId: 509692) - Continue heavy INSERT operations while ALTER job reaches
FINISHED_REWRITINGstate - Observe ALTER job stuck in
FINISHED_REWRITINGstate indefinitely - Attempt new ALTER operation:
ALTER TABLE speechai.ods_server_log_test003 COMPACT;
Expected behavior (Required)
- ALTER TABLE job should complete successfully or fail gracefully with retry capability
- Version sequence should be continuous without gaps
- System should handle concurrent transactions and ALTER operations without conflicts
- New ALTER operations should be allowed after previous job completes
Real behavior (Required)
- ALTER job stuck permanently in
FINISHED_REWRITINGstate - Partition version discontinuity:
VisibleVersion: 10828,NextVersion: 12830(missing version 10829) - Subsequent transactions fail with:
partition.getVisibleVersion() + 1 != version.get(0) 395376 10828 10830 - New ALTER operations blocked:
ERROR 1064 (HY000): A schema change operation is in progress - Table state locked as
SCHEMA_CHANGE, rendering table completely unusable - Issue persists across FE restarts
Error Logs
2025-10-29 15:03:41.795+08:00 ERROR (lake-publish-task-117|357) [PublishVersionDaemon.publishPartitionBatch():530]
partition.getVisibleVersion() + 1 != version.get(0) 395376 10828 10830
Root Cause
- Version management race condition in shared-data mode: ALTER job and concurrent transactions compete for database read locks
- Strict version continuity check:
LakeTableSchemaChangeJob.readyToPublishVersion()requirescommitVersion == partition.getVisibleVersion() + 1 - Lock contention: ALTER job cannot acquire lock in time to publish version 10829
- Version gap: Concurrent transactions skip version 10829 and publish version 10830, breaking version sequence
- Table state lock: Table state set to
SCHEMA_CHANGE, blocking all subsequent ALTER operations
Impact
- Critical: Table completely unusable, all ALTER operations blocked
- High: Data ingestion failures, version publish errors
- Medium: Query performance degradation
Additional Information
- Architecture: Shared-data mode (compute-storage separation)
- Partition: PartitionId 395376, Size 65.7GB, Rows 177,415,796
- Version Gap: Missing version 10829 (sequence: 10828 → 12830)
- Persistence: Issue survives FE restarts due to serialized metadata
Metadata
Metadata
Assignees
Labels
type/bugSomething isn't workingSomething isn't working