-
Notifications
You must be signed in to change notification settings - Fork 961
Description
Describe the bug
could not find arguments in the checkpoint ...
checkpoint version 3.0
WARNING:megatron.core.rerun_state_machine:RerunStateMachine disabled via CLI, ignoring machine state saved in checkpoint
successfully loaded checkpoint from /ssd4/nietianyu/workspace/ms-swift/model/Qwen3-Next-80B-A3B-Instruct-mcore [ t 1/1, p 1/1 ] at iteration 0
>> '--exit-on-missing-checkpoint' set ... exiting. <<
Your hardware and system info
续训命令是
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
NPROC_PER_NODE=8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
megatron sft
--load ms-swift/model/Qwen3-Next-80B-A3B-Instruct-mcore
--dataset 'json_data/high_quality_data_0101_1022_265w_converted.json'
--train_type lora
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--expert_model_parallel_size 8
--moe_permute_fusion true
--moe_grouped_gemm true
--moe_shared_expert_overlap true
--moe_aux_loss_coeff 1e-3
--micro_batch_size 4
--global_batch_size 32
--recompute_granularity full
--recompute_method uniform
--recompute_num_layers 1
--max_epochs 2
--finetune true
--cross_entropy_loss_fusion true
--lr 1e-4
--lr_warmup_fraction 0.05
--min_lr 1e-5
--save ms-swift/megatron_output/Qwen3-Next-80B-A3B-Instruct
--save_interval 41488
--max_length 1024
--num_workers 8
--dataset_num_proc 8
--no_save_optim true
--no_save_rng true
--sequence_parallel true
--attention_backend flash
--model_author swift
--model_name swift-robot
--finetune true
--adapter_load ms-swift/megatron_output/Qwen3-Next-80B-A3B-Instruct/v34-20251103-115154/iter_0020700
Additional context
参考文档没太看懂续训命令怎么设置,finetune 设置false也是一样的报错。请问一下怎么解决?