You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
5. Please use English, otherwise it will be closed.
Describe the bug
Unhandled cuda error during broadcast
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=3 op.reqBuff=0x7f2bac0491a0 op.respSize=80 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Setup res=0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/CUMEM
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO New proxy send connection 44 from local rank 1, transport 0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=1 op.reqBuff=0x7f2bac0491a0 op.respSize=16 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2bac0063e0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Init res=0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Allocated shareable buffer 0xa0e600000 size 2097152 ipcDesc 0x7f2bac04a970
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=3 op.reqBuff=0x7f2bac04a940 op.respSize=80 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/CUMEM
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Setup res=0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO New proxy send connection 45 from local rank 1, transport 0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=1 op.reqBuff=0x7f2bac04a940 op.respSize=16 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2bac006458
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Init res=0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Allocated shareable buffer 0xa0e800000 size 2097152 ipcDesc 0x7f2bac04c110
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=3 op.reqBuff=0x7f2bac04c0e0 op.respSize=80 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Setup res=0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/CUMEM
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO New proxy send connection 46 from local rank 1, transport 0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=1 op.reqBuff=0x7f2bac04c0e0 op.respSize=16 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2bac0064d0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Init res=0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Allocated shareable buffer 0xa0ea00000 size 2097152 ipcDesc 0x7f2bac04d8b0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=3 op.reqBuff=0x7f2bac04d880 op.respSize=80 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/CUMEM
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Setup res=0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO New proxy send connection 47 from local rank 1, transport 0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=1 op.reqBuff=0x7f2bac04d880 op.respSize=16 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2bac006548
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Init res=0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Allocated shareable buffer 0xa0ec00000 size 2097152 ipcDesc 0x7f2bac04f050
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=3 op.reqBuff=0x7f2bac04f020 op.respSize=80 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Setup res=0
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] transport/p2p.cc:275 NCCL WARN Cuda failure 'invalid argument'
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] NCCL INFO transport/p2p.cc:330 -> 1
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] NCCL INFO transport/p2p.cc:460 -> 1
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] NCCL INFO transport.cc:165 -> 1
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ProxyCall UDS comm 0x55b2e3c71350 rank 1 tpRank 0(b7a7bd7508f9845d) reqSize 8 respSize 0 respFd 0x7f2be0fd7ca8 opId 0xa3cc92762de46676
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] NCCL INFO init.cc:1263 -> 1
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] NCCL INFO init.cc:1548 -> 1
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] NCCL INFO group.cc:64 -> 1 [Async thread]
6f747114300646eb-5af0524bda8e4aee:4167:4167 [0] NCCL INFO group.cc:418 -> 1
6f747114300646eb-5af0524bda8e4aee:4167:4167 [0] NCCL INFO init.cc:1929 -> 1
6f747114300646eb-5af0524bda8e4aee:4167:5289 [0] NCCL INFO proxyUDSRecvReq::ncclProxyMsgGetFd rank 1 opId 0xa3cc92762de46676 handle=0x564e1ead6750
6f747114300646eb-5af0524bda8e4aee:4167:5289 [0] NCCL INFO UDS proxyGetFd received handle 0x564e1ead6750 peer 1 opId a3cc92762de46676
6f747114300646eb-5af0524bda8e4aee:4167:5289 [0] proxy.cc:1341 NCCL WARN Cuda failure 1 'invalid argument'
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/amlfs-01/home/jingwang/LEARNS/learn_verl_0512/verl/tests/workers/rollout/test_sglang_async_spmd.py", line 115, in <module>
[rank0]: test_sglang_spmd()
[rank0]: File "/mnt/amlfs-01/home/jingwang/LEARNS/learn_verl_0512/verl/tests/workers/rollout/test_sglang_async_spmd.py", line 97, in test_sglang_spmd
[rank0]: [outputs] = broadcast_pyobj(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/sglang/srt/utils.py", line 893, in broadcast_pyobj
[rank0]: dist.broadcast(tensor_size, src=src, group=dist_group)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank0]: work = group.broadcast([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Cuda failure 'invalid argument'
6f747114300646eb-5af0524bda8e4aee:4167:5289 [0] NCCL INFO [Proxy Service UDS] exit: stop 0 abortFlag 1
[rank0]:[W513 05:03:06.244537893 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0513 05:03:07.672000 4162 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 4168 closing signal SIGTERM
E0513 05:03:08.292000 4162 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 4167) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
test_sglang_async_spmd.py FAILED
Reproduction
Use nvcr.io/nvidia/pytorch:24.08-py3 as base image
Checklist
Describe the bug
Unhandled cuda error during broadcast
Reproduction
cd verl/tests/workers/rollout
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL TORCH_DISTRIBUTED_DEBUG=DETAIL torchrun --nnodes=1 --nproc_per_node=2 test_sglang_async_spmd.py
Environment
The text was updated successfully, but these errors were encountered: