Skip to content

使用PPYOLOE_Plus_S训练COCO数据集时无法实现混合精度训练(--amp)以及多GPU训练 #268

@zjykzj

Description

@zjykzj

问题确认 Search before asking

  • 我已经查询历史issue,没有发现相似的bug。I have searched the issues and found no similar bug report.

Bug组件 Bug Component

Training

Bug描述 Describe the Bug

@nemonameless @jerrywgz 很奇怪,为什么PaddleYOLO不支持混合精度训练以及多GPU训练?是我的配置有问题吗?还是我的训练环境(Docker容器)?

环境配置

按照官网教程训练COCO数据集,首先配置Docker环境:

docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:2.6.2-gpu-cuda11.2-cudnn8.2-trt8.0

然后下载PaddleYOLO仓库,安装环境依赖

git clone https://github.com/PaddlePaddle/PaddleYOLO.git
pip3 install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

接下来配置COCO数据集,

(base) user@user-X11DAi-N:/data/zj/paddle/PaddleYOLO/dataset/coco$ ls
annotations  download_coco.py  train2017  val2017
(base) user@user-X11DAi-N:/data/zj/paddle/PaddleYOLO/dataset/coco$ pwd
/data/zj/paddle/PaddleYOLO/dataset/coco

尝试了几种方式训练PPYOLOE_PLUS_S + COCO,

单GPU + 混合精度训练

# 我在配置文件PaddleYOLO/configs/ppyoloe/_base_/ppyoloe_plus_reader.yml上设置batch_size: 32,其余不变
CUDA_VISIBLE_DEVICES=1 python3 tools/train.py -c configs/ppyoloe/ppyoloe_plus_crn_s_80e_coco_single_gpu.yml --eval --amp -o pretrain_weights=weights/ppyoloe_crn_s_obj365_pretrained.pdparams save_dir=output/ppyoloe_plus_crn_s_80e_coco_single_gpu

训练50轮的时候数据集验证精度还是0,从下面截图可以发现损失并没有收敛

Image

多GPU + 混合精度训练

然后尝试开启多GPU进行训练,发现模型预训练权重加载后就停住了

python -m paddle.distributed.launch --log_dir=./log_dir --gpus 0,1 tools/train.py -c configs/ppyoloe/ppyoloe_plus_crn_s_80e_coco_single_gpu.yml --eval --amp -o pretrain_weights=weights/ppyoloe_crn_s_obj365_pretrained.pdparams save_dir=ou
tput/ppyoloe_plus_crn_s_80e_coco_two_gpu

Image

单GPU + 禁止AMP

最后尝试了取消--amp选项进行训练,从目前来看训练正常,损失收敛,数据集验证有提升

CUDA_VISIBLE_DEVICES=1 python3 tools/train.py -c configs/ppyoloe/ppyoloe_plus_crn_s_80e_coco_single_gpu.yml --eval -o pretrain_weights=weights/ppyoloe_crn_s_obj365_pretrained.pdparams

Image

复现环境 Environment

  • Docker: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:2.6.2-gpu-cuda11.2-cudnn8.2-trt8.0
λ user-X11DAi-N /workdir/paddle/PaddleYOLO/configs/ppyoloe/_base_ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.6 LTS
Release:        20.04
Codename:       focal

λ user-X11DAi-N /workdir/paddle/PaddleYOLO/configs/ppyoloe/_base_ export CUDA_VISIBLE_DEVICES=0
λ user-X11DAi-N /workdir/paddle/PaddleYOLO/configs/ppyoloe/_base_ python -c "import paddle; paddle.utils.run_check()"
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
Running verify PaddlePaddle program ...
I0703 05:48:20.698544 31995 program_interpreter.cc:212] New Executor is Running.
W0703 05:48:20.699072 31995 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.3, Runtime API Version: 11.2
W0703 05:48:20.703392 31995 gpu_resources.cc:164] device: 0, cuDNN Version: 8.1.
I0703 05:48:24.718087 31995 interpreter_util.cc:624] Standalone Executor is Used.
PaddlePaddle works well on 1 GPU.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

Bug描述确认 Bug description confirmation

  • 我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息,确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.

是否愿意提交PR? Are you willing to submit a PR?

  • 我愿意提交PR!I'd like to help by submitting a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions