Skip to content

训练过程中由失真图像生成修复图像时报错 #96

@jachinzhang1

Description

@jachinzhang1

作者您好!最近在尝试训练模型(universal-image-restoration/config/daclip-sde/train.py)的过程中一直出现这样的报错:

(以上是训练和生成修复图像的正常输出信息)
100it [00:12,  8.26it/s]
15it [00:01,  8.22it/s][rank1]:[E122 13:30:37.927897011 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=79999, OpType=ALLREDUCE, NumelIn=268291, NumelOut=268291, Timeout(ms)=600000) ran for 600066 milliseconds before timing out.
[rank1]:[E122 13:30:37.928802041 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 79999, last enqueued NCCL work: 80006, last completed NCCL work: 79998.
[rank2]:[E122 13:30:37.952670698 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=79999, OpType=ALLREDUCE, NumelIn=268291, NumelOut=268291, Timeout(ms)=600000) ran for 600065 milliseconds before timing out.
[rank2]:[E122 13:30:37.954097610 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 79999, last enqueued NCCL work: 80006, last completed NCCL work: 79998.
[rank3]:[E122 13:30:37.963740104 ProcessGroupNCCL.cpp:607] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=79999, OpType=ALLREDUCE, NumelIn=268291, NumelOut=268291, Timeout(ms)=600000) ran for 600096 milliseconds before timing out.
[rank3]:[E122 13:30:37.964974112 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 79999, last enqueued NCCL work: 80006, last completed NCCL work: 79998.
75it [00:09,  8.20it/s][rank3]:[E122 13:30:45.301246375 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 3] Timeout at NCCL work: 79999, last enqueued NCCL work: 80006, last completed NCCL work: 79998.
[rank3]:[E122 13:30:45.301273081 ProcessGroupNCCL.cpp:621] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E122 13:30:45.301281165 ProcessGroupNCCL.cpp:627] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E122 13:30:45.302693304 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=79999, OpType=ALLREDUCE, NumelIn=268291, NumelOut=268291, Timeout(ms)=600000) ran for 600096 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1720538455419/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7457bd776f86 in /home/ippl-2080/anaconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x745769deea42 in /home/ippl-2080/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x745769df5483 in /home/ippl-2080/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x745769df786c in /home/ippl-2080/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7457be0dbbf4 in /home/ippl-2080/anaconda3/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7457d7e94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7457d7f26850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E122 13:30:45.303081292 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 79999, last enqueued NCCL work: 80006, last completed NCCL work: 79998.
[rank1]:[E122 13:30:45.303109726 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E122 13:30:45.303117859 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E122 13:30:45.304396047 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=79999, OpType=ALLREDUCE, NumelIn=268291, NumelOut=268291, Timeout(ms)=600000) ran for 600066 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1720538455419/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x789ca4b76f86 in /home/ippl-2080/anaconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x789c511eea42 in /home/ippl-2080/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x789c511f5483 in /home/ippl-2080/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x789c511f786c in /home/ippl-2080/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x789ca54dbbf4 in /home/ippl-2080/anaconda3/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x789cbf294ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x789cbf326850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E122 13:30:45.316481274 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 2] Timeout at NCCL work: 79999, last enqueued NCCL work: 80006, last completed NCCL work: 79998.
[rank2]:[E122 13:30:45.316502333 ProcessGroupNCCL.cpp:621] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E122 13:30:45.316524410 ProcessGroupNCCL.cpp:627] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E122 13:30:45.317771092 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=79999, OpType=ALLREDUCE, NumelIn=268291, NumelOut=268291, Timeout(ms)=600000) ran for 600065 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1720538455419/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7154a2376f86 in /home/ippl-2080/anaconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x71544e9eea42 in /home/ippl-2080/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x71544e9f5483 in /home/ippl-2080/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x71544e9f786c in /home/ippl-2080/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7154a2cdbbf4 in /home/ippl-2080/anaconda3/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7154bca94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7154bcb26850 in /lib/x86_64-linux-gnu/libc.so.6)

80it [00:09,  8.17it/s]W0122 13:30:45.695000 136990210230080 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1986990 closing signal SIGTERM
W0122 13:30:45.698000 136990210230080 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1986991 closing signal SIGTERM
W0122 13:30:45.699000 136990210230080 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1986993 closing signal SIGTERM
/home/ippl-2080/anaconda3/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 57 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/ippl-2080/anaconda3/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 57 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/ippl-2080/anaconda3/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 58 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
E0122 13:30:46.370000 136990210230080 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 2 (pid: 1986992) of binary: /home/ippl-2080/anaconda3/bin/python
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/ippl-2080/anaconda3/lib/python3.12/site-packages/torch/distributed/launch.py", line 208, in <module>
    main()
  File "/home/ippl-2080/anaconda3/lib/python3.12/site-packages/typing_extensions.py", line 2636, in wrapper
    return arg(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ippl-2080/anaconda3/lib/python3.12/site-packages/torch/distributed/launch.py", line 204, in main
    launch(args)
  File "/home/ippl-2080/anaconda3/lib/python3.12/site-packages/torch/distributed/launch.py", line 189, in launch
    run(args)
  File "/home/ippl-2080/anaconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/ippl-2080/anaconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ippl-2080/anaconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
train.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-22_13:30:45
  host      : ippl-2080
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 1986992)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 1986992
========================================================

训练使用的配置是4张2080的显卡,cuda version=12.2。训练的任务是去雨,使用RainTrainH数据集,在成功生成47张修复图像后产生如上报错。目前初步判断是进程间通信的问题,我们也尝试修改了初始化分布式训练的函数,但仍然会出现相同的报错,该问题暂时无法解决。

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions