Problem with Flux Schnell bfloat16 multiGPU

### Describe the bug

Hello! I set device_map='balanced' and get images generated in 2.5 minutes (expected in 12-20 seconds), while in pipe.hf_device_map it shows that the devices are distributed like this:
```
{
 "transformer": "cuda:0",
 "text_encoder_2": "cuda:2",
 "text_encoder": "cuda:0",
 "vae": "cuda:1"
                }
```
I have 3 video cards 3090 Ti 24 GB and I can’t run it on them.

I also tried this way:
            pipe.transformer.to('cuda:2')
            pipe.text_encoder.to('cuda:2')
            pipe.text_encoder_2.to('cuda:1')
            pipe.vae.to('cuda:0')

What is the best way to launch it so that generation occurs on the GPU and quickly?

### Reproduction
```python
            pipe = FluxPipeline.from_pretrained(
                path_chkpt,
                torch_dtype=torch.bfloat16,
                device_map='balanced',
            )
```
### Logs

_No response_

### System Info

ubuntu 22.04 3 GPU: 3090 TI 24 GB

accelerate==0.30.1
addict==2.4.0
apscheduler==3.9.1
autocorrect==2.5.0
chardet==4.0.0
cryptography==37.0.2
curl_cffi
diffusers==0.30.0
beautifulsoup4==4.11.2
einops
facexlib>=0.2.5
fastapi==0.92.0
hidiffusion==0.1.6
invisible-watermark>=0.2.0
numpy==1.24.3
opencv-python==4.8.0.74
pandas==2.0.3
pycocotools==2.0.6
pymystem3==0.2.0
pyyaml==6.0
pyjwt==2.6.0
python-multipart==0.0.5
pytrends==4.9.1
psycopg2-binary
realesrgan==0.3.0
redis==4.5.1
sacremoses==0.0.53
selenium==4.2.0
sentencepiece==0.1.97
scipy==1.10.1
scikit-learn==0.24.1
supervision==0.16.0
tb-nightly==2.14.0a20230629
tensorboard>=2.13.0
tomesd
transformers==4.40.1
timm==0.9.16
yapf==0.32.0
uvicorn==0.20.0

spacy==3.7.2
nest_asyncio==1.5.8
httpx==0.25.0

torchvision==0.15.2

insightface==0.7.3
psutil==5.9.6
tk==0.1.0
customtkinter==5.2.1
tensorflow==2.13.0
opennsfw2==0.10.2
protobuf==4.24.4
gfpgan==1.3.8

### Who can help?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problem with Flux Schnell bfloat16 multiGPU #9195

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problem with Flux Schnell bfloat16 multiGPU #9195

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions