SPDA, FA2 vs. Eager Attention Implementation leading to different losses

### System Info

Hi this is using TRL but it seems like a lower level issue. 

I'm training a variant of Qwen3 (Intern-S1-mini) but I'm not using the vision tower so it's effectively Qwen3-8B. I've been doing finetuning and checking different attention implementations i.e. SPDA vs. Flash Attention 2. However, I've been getting strange results where the downstream test accuracy is different (FA2 is worse). Furthermore, it seems like this issue is accentuated with grad accumulation. I'm not sure what's the best way to share this as my current code abstracts upon HF Trainer for my personal convenience. 



### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

Here are the current values of my config

```"context_length": 0,
"per_device_train_batch_size": 16,
"gradient_accumulation_steps": 2,
"optim": "paged_adamw_8bit",
"evaluation_strategy": "epoch",
"weight_decay": 0.1,
"gradient_checkpointing": true,
"use_liger_kernel": true,
"num_train_epochs": 1,
"learning_rate": 8e-05,
"lr_scheduler_type": "cosine",
"warmup_steps": 0,
"warmup_ratio": 0.1,
"report_to": "wandb",
"run_name": "finetune_Tox_internlm_Intern-S1-mini",
"logging_steps": 1,
"logging_strategy": "steps",
"save_strategy": "no",
"remove_unused_columns": false,
"seed": 42,
"completion_only_loss": false,
"dataset_text_field": "text",
"packing": false,
"padding_free": false,
"loss_type": "nll"```

### Expected behavior

They should have equal test accuracy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPDA, FA2 vs. Eager Attention Implementation leading to different losses #41944

System Info

Who can help?

Information

Tasks

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SPDA, FA2 vs. Eager Attention Implementation leading to different losses #41944

Description

System Info

Who can help?

Information

Tasks

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions