-
Notifications
You must be signed in to change notification settings - Fork 153
Open
Description
问题确认 Search before asking
Bug组件 Bug Component
Training
Bug描述 Describe the Bug
@nemonameless @jerrywgz 很奇怪,为什么PaddleYOLO不支持混合精度训练以及多GPU训练?是我的配置有问题吗?还是我的训练环境(Docker容器)?
环境配置
按照官网教程训练COCO数据集,首先配置Docker环境:
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:2.6.2-gpu-cuda11.2-cudnn8.2-trt8.0然后下载PaddleYOLO仓库,安装环境依赖
git clone https://github.com/PaddlePaddle/PaddleYOLO.git
pip3 install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple接下来配置COCO数据集,
(base) user@user-X11DAi-N:/data/zj/paddle/PaddleYOLO/dataset/coco$ ls
annotations download_coco.py train2017 val2017
(base) user@user-X11DAi-N:/data/zj/paddle/PaddleYOLO/dataset/coco$ pwd
/data/zj/paddle/PaddleYOLO/dataset/coco尝试了几种方式训练PPYOLOE_PLUS_S + COCO,
单GPU + 混合精度训练
# 我在配置文件PaddleYOLO/configs/ppyoloe/_base_/ppyoloe_plus_reader.yml上设置batch_size: 32,其余不变
CUDA_VISIBLE_DEVICES=1 python3 tools/train.py -c configs/ppyoloe/ppyoloe_plus_crn_s_80e_coco_single_gpu.yml --eval --amp -o pretrain_weights=weights/ppyoloe_crn_s_obj365_pretrained.pdparams save_dir=output/ppyoloe_plus_crn_s_80e_coco_single_gpu训练50轮的时候数据集验证精度还是0,从下面截图可以发现损失并没有收敛
多GPU + 混合精度训练
然后尝试开启多GPU进行训练,发现模型预训练权重加载后就停住了
python -m paddle.distributed.launch --log_dir=./log_dir --gpus 0,1 tools/train.py -c configs/ppyoloe/ppyoloe_plus_crn_s_80e_coco_single_gpu.yml --eval --amp -o pretrain_weights=weights/ppyoloe_crn_s_obj365_pretrained.pdparams save_dir=ou
tput/ppyoloe_plus_crn_s_80e_coco_two_gpu单GPU + 禁止AMP
最后尝试了取消--amp选项进行训练,从目前来看训练正常,损失收敛,数据集验证有提升
CUDA_VISIBLE_DEVICES=1 python3 tools/train.py -c configs/ppyoloe/ppyoloe_plus_crn_s_80e_coco_single_gpu.yml --eval -o pretrain_weights=weights/ppyoloe_crn_s_obj365_pretrained.pdparams复现环境 Environment
- Docker: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:2.6.2-gpu-cuda11.2-cudnn8.2-trt8.0
λ user-X11DAi-N /workdir/paddle/PaddleYOLO/configs/ppyoloe/_base_ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focal
λ user-X11DAi-N /workdir/paddle/PaddleYOLO/configs/ppyoloe/_base_ export CUDA_VISIBLE_DEVICES=0
λ user-X11DAi-N /workdir/paddle/PaddleYOLO/configs/ppyoloe/_base_ python -c "import paddle; paddle.utils.run_check()"
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
Running verify PaddlePaddle program ...
I0703 05:48:20.698544 31995 program_interpreter.cc:212] New Executor is Running.
W0703 05:48:20.699072 31995 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.3, Runtime API Version: 11.2
W0703 05:48:20.703392 31995 gpu_resources.cc:164] device: 0, cuDNN Version: 8.1.
I0703 05:48:24.718087 31995 interpreter_util.cc:624] Standalone Executor is Used.
PaddlePaddle works well on 1 GPU.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.Bug描述确认 Bug description confirmation
- 我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息,确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.
是否愿意提交PR? Are you willing to submit a PR?
- 我愿意提交PR!I'd like to help by submitting a PR!
Metadata
Metadata
Assignees
Labels
No labels


