Skip to content

issue in inference_s2s_batch.sh #218

@Lalaramarya

Description

@Lalaramarya

#######Thank you for your help in resolving the earlier issues! However, I'm now facing a new problem during inference:

Generating: 0%| | 0/3000 [00:00<?, ?it/s]We detected that you are passing past_key_values as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate Cache class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Generating: 16%|████████████████████████▊ | 469/3000 [00:24<02:12, 19.07it/s]
[2025-03-31 20:48:37][root][INFO] - LLM Inference Time: 25.14s
Error executing job with overrides: ['++model_config.llm_name=qwen2-0.5b', '++model_config.llm_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/Qwen2-0.5B', '++model_config.llm_dim=896', '++model_config.encoder_name=whisper', '++model_config.encoder_projector_ds_rate=5', '++model_config.encoder_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/small.pt', '++model_config.encoder_dim=768', '++model_config.encoder_projector=linear', '++model_config.codec_decoder_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/pretrained_models/CosyVoice-300M-SFT', '++model_config.codec_decode=true', '++model_config.vocab_config.code_layer=3', '++model_config.vocab_config.total_audio_vocabsize=4160', '++model_config.vocab_config.total_vocabsize=156160', '++model_config.code_type=CosyVoice', '++model_config.codec_decoder_type=CosyVoice', '++model_config.group_decode=true', '++model_config.group_decode_adapter_type=linear', '++dataset_config.dataset=speech_dataset_s2s', '++dataset_config.val_data_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/Dataset/VoiceAssistant-400K-SLAM-Omni/data/dev_manifest.jsonl', '++dataset_config.train_data_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/Dataset/VoiceAssistant-400K-SLAM-Omni/data/dev_manifest.jsonl', '++dataset_config.input_type=mel', '++dataset_config.mel_size=80', '++dataset_config.inference_mode=true', '++dataset_config.manifest_format=jsonl', '++dataset_config.split_size=0.002', '++dataset_config.load_from_cache_file=false', '++dataset_config.task_type=s2s', '++dataset_config.seed=777', '++dataset_config.vocab_config.code_layer=3', '++dataset_config.vocab_config.total_audio_vocabsize=4160', '++dataset_config.vocab_config.total_vocabsize=156160', '++dataset_config.code_type=CosyVoice', '++dataset_config.num_latency_tokens=0', '++dataset_config.do_layershift=false', '++train_config.model_name=s2s', '++train_config.freeze_encoder=true', '++train_config.freeze_llm=true', '++train_config.freeze_encoder_projector=true', '++train_config.freeze_group_decode_adapter=true', '++train_config.batching_strategy=custom', '++train_config.num_epochs=1', '++train_config.val_batch_size=1', '++train_config.num_workers_dataloader=2', '++train_config.task_type=s2s', '++decode_config.text_repetition_penalty=1.2', '++decode_config.audio_repetition_penalty=1.2', '++decode_config.max_new_tokens=3000', '++decode_config.task_type=s2s', '++decode_config.do_sample=false', '++decode_config.top_p=1.0', '++decode_config.top_k=0', '++decode_config.temperature=1.0', '++decode_config.decode_text_only=false', '++decode_config.do_layershift=false', '++decode_log=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/Qwen2-0.5b-whisper_small-latency0-group3-single-round-English-20250201T121121Z-002/Qwen2-0.5b-whisper_small-latency0-group3-single-round-English/s2s_decode__trp1.2_arp1.2_seed777_greedy', '++decode_config.num_latency_tokens=0', '++ckpt_path=/DATA/Lalaram/SLAM_omni/SLAM-LLM/models/Qwen2-0.5b-whisper_small-latency0-group3-single-round-English-20250201T121121Z-002/Qwen2-0.5b-whisper_small-latency0-group3-single-round-English/model.pt', '++output_text_only=false', '++inference_online=false', '++speech_sample_rate=22050', '++audio_prompt_path=/DATA/Lalaram/SLAM_omni_Jsn/SLAM-LLM/examples/s2s/audio_prompt/en/prompt_3.wav']
Traceback (most recent call last):
File "/DATA/Lalaram/SLAM_omni_Jsn/SLAM-LLM/examples/s2s/inference_s2s.py", line 102, in main_hydra
batch_inference(cfg)
File "/DATA/Lalaram/SLAM_omni_Jsn/SLAM-LLM/examples/s2s/generate/generate_s2s_batch.py", line 176, in main
q.write(key + "\t" + source_text + "\n")
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I'm facing this issue while running inference_s2s_batch.sh with both the pre-trained and fine-tuned models. However, when I load the pre-trained model using inference_s2s_online.sh, it successfully generates both the target text and audio. Please look into this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions