Skip to content

SPDA, FA2 vs. Eager Attention Implementation leading to different losses #41944

@jiosephlee

Description

@jiosephlee

System Info

Hi this is using TRL but it seems like a lower level issue.

I'm training a variant of Qwen3 (Intern-S1-mini) but I'm not using the vision tower so it's effectively Qwen3-8B. I've been doing finetuning and checking different attention implementations i.e. SPDA vs. Flash Attention 2. However, I've been getting strange results where the downstream test accuracy is different (FA2 is worse). Furthermore, it seems like this issue is accentuated with grad accumulation. I'm not sure what's the best way to share this as my current code abstracts upon HF Trainer for my personal convenience.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Here are the current values of my config

"per_device_train_batch_size": 16,
"gradient_accumulation_steps": 2,
"optim": "paged_adamw_8bit",
"evaluation_strategy": "epoch",
"weight_decay": 0.1,
"gradient_checkpointing": true,
"use_liger_kernel": true,
"num_train_epochs": 1,
"learning_rate": 8e-05,
"lr_scheduler_type": "cosine",
"warmup_steps": 0,
"warmup_ratio": 0.1,
"report_to": "wandb",
"run_name": "finetune_Tox_internlm_Intern-S1-mini",
"logging_steps": 1,
"logging_strategy": "steps",
"save_strategy": "no",
"remove_unused_columns": false,
"seed": 42,
"completion_only_loss": false,
"dataset_text_field": "text",
"packing": false,
"padding_free": false,
"loss_type": "nll"```

### Expected behavior

They should have equal test accuracy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions