Skip to content

megatron lora续训Qwen3-Next-80B-A3B-Instruct 报错 #6471

@xiaotianns

Description

@xiaotianns

Describe the bug

Image could not find arguments in the checkpoint ... checkpoint version 3.0 WARNING:megatron.core.rerun_state_machine:RerunStateMachine disabled via CLI, ignoring machine state saved in checkpoint successfully loaded checkpoint from /ssd4/nietianyu/workspace/ms-swift/model/Qwen3-Next-80B-A3B-Instruct-mcore [ t 1/1, p 1/1 ] at iteration 0 >> '--exit-on-missing-checkpoint' set ... exiting. <<

Your hardware and system info
续训命令是

PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
NPROC_PER_NODE=8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
megatron sft
--load ms-swift/model/Qwen3-Next-80B-A3B-Instruct-mcore
--dataset 'json_data/high_quality_data_0101_1022_265w_converted.json'
--train_type lora
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--expert_model_parallel_size 8
--moe_permute_fusion true
--moe_grouped_gemm true
--moe_shared_expert_overlap true
--moe_aux_loss_coeff 1e-3
--micro_batch_size 4
--global_batch_size 32
--recompute_granularity full
--recompute_method uniform
--recompute_num_layers 1
--max_epochs 2
--finetune true
--cross_entropy_loss_fusion true
--lr 1e-4
--lr_warmup_fraction 0.05
--min_lr 1e-5
--save ms-swift/megatron_output/Qwen3-Next-80B-A3B-Instruct
--save_interval 41488
--max_length 1024
--num_workers 8
--dataset_num_proc 8
--no_save_optim true
--no_save_rng true
--sequence_parallel true
--attention_backend flash
--model_author swift
--model_name swift-robot
--finetune true
--adapter_load ms-swift/megatron_output/Qwen3-Next-80B-A3B-Instruct/v34-20251103-115154/iter_0020700

Additional context
参考文档没太看懂续训命令怎么设置,finetune 设置false也是一样的报错。请问一下怎么解决?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions