You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{'loss': 0.6854, 'grad_norm': 1.2066658735275269, 'learning_rate': 7.270549063924843e-06, 'epoch': 0.35}
35%|█████████████████████████████████████████████████████████████████▍ | 821/2346 [98:44:29<189:15:35, 446.78s/it]
35%|█████████████████████████████████████████████████████████████████▌ | 822/2346 [98:51:09<183:14:26, 432.85s/it]
{'loss': 0.6851, 'grad_norm': 0.9848567247390747, 'learning_rate': 7.264581581159024e-06, 'epoch': 0.35}
35%|█████████████████████████████████████████████████████████████████▌ | 822/2346 [98:51:09<183:14:26, 432.85s/it]EE9999: Inner Error!
EE9999: 2025-02-01-16:16:28.694.058 The error from device(chipId:0, dieId:0), serial number is 4, event wait timeout occurred during task execution, stream_id:22, sq_id:22, task_id:10603, event_id=3395, timeout=1868.[FUNC:ProcessStarsWaitTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1389]
TraceBack (most recent call last):
Task execute failed, device_id=0, stream_id=22, task_id=10603, flip_num=720, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1082]
rtStreamSynchronizeWithTimeout execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
synchronize stream failed, runtime result = 107020[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
**EL0004: 2025-01-28-17:13:23.993.343 Failed to allocate memory.
Possible Cause: Available memory is insufficient.
Solution: Close applications not in use.
TraceBack (most recent call last):
rtMalloc execute failed, reason=[driver error:out of memory][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
alloc device memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]**
The error from device(chipId:1, dieId:0), serial number is 2, event wait timeout occurred during task execution, stream_id:6, sq_id:6, task_id:9295, event_id=3015, timeout=1868.[FUNC:ProcessStarsWaitTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1389]
Task execute failed, device_id=1, stream_id=6, task_id=9295, flip_num=720, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1082]
rtStreamSynchronizeWithTimeout execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
synchronize stream failed, runtime result = 107020[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
Traceback (most recent call last):
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in <module>
launch()
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
run_exp()
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 92, in run_exp
Traceback (most recent call last):
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in <module>
_training_function(config={"args": args, "callbacks": callbacks})
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 66, in _training_function
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 101, in run_sft
launch()
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
run_exp()
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 92, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 66, in _training_function
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 101, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2123, in train
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2123, in train
return inner_training_loop(
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2478, in _inner_training_loop
return inner_training_loop(
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2478, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 3610, in training_step
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 3610, in training_step
self.accelerator.backward(loss, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/accelerator.py", line 2239, in backward
self.accelerator.backward(loss, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/accelerator.py", line 2239, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
self.engine.backward(loss, **kwargs)self.engine.backward(loss, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
ret_val = func(*args, **kwargs) File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)scaled_loss.backward(retain_graph=retain_graph)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
torch.autograd.backward(
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1173, in reduce_partition_and_remove_grads
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1173, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1528, in reduce_ready_partitions_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1528, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1265, in reduce_independent_p_g_buckets_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1265, in reduce_independent_p_g_buckets_and_remove_grads
self.__reduce_and_partition_ipg_grads()
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
self.__reduce_and_partition_ipg_grads()
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
return func(*args, **kwargs) File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1317, in __reduce_and_partition_ipg_grads
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1317, in __reduce_and_partition_ipg_grads
self.partition_grads(self.params_in_ipg_bucket, grad_partitions)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
self.partition_grads(self.params_in_ipg_bucket, grad_partitions)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1484, in partition_grads
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1484, in partition_grads
cuda_grad_buffer = grad_buffer.to(grad_partition.device, non_blocking=True)
RuntimeErrorcuda_grad_buffer = grad_buffer.to(grad_partition.device, non_blocking=True):
ACL stream synchronize failed.
RuntimeError: ACL stream synchronize failed.
EE9999: Inner Error!
EE9999: 2025-02-01-16:16:28.755.133 The error from device(chipId:3, dieId:0), serial number is 3, event wait timeout occurred during task execution, stream_id:6, sq_id:6, task_id:5388, event_id=4013, timeout=1868.[FUNC:ProcessStarsWaitTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1389]
TraceBack (most recent call last):
Task execute failed, device_id=3, stream_id=6, task_id=5388, flip_num=720, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1082]
rtStreamSynchronizeWithTimeout execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
synchronize stream failed, runtime result = 107020[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
**EL0004: 2025-01-29-06:20:28.300.866 Failed to allocate memory.
Possible Cause: Available memory is insufficient.
Solution: Close applications not in use.
TraceBack (most recent call last):
rtMalloc execute failed, reason=[driver error:out of memory][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
alloc device memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
The error from device(chipId:5, dieId:0), serial number is 2, event wait timeout occurred during task execution, stream_id:6, sq_id:6, task_id:8355, event_id=2510, timeout=1868.[FUNC:ProcessStarsWaitTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1389]
Task execute failed, device_id=5, stream_id=6, task_id=8355, flip_num=720, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1082]
rtStreamSynchronizeWithTimeout execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
synchronize stream failed, runtime result = 107020[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]**
Traceback (most recent call last):
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in <module>
launch()
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
run_exp()
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 92, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 66, in _training_function
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 101, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2123, in train
return inner_training_loop(
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2478, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 3610, in training_step
self.accelerator.backward(loss, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/accelerator.py", line 2239, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
self.engine.backward(loss, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1173, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1528, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1265, in reduce_independent_p_g_buckets_and_remove_grads
self.__reduce_and_partition_ipg_grads()
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
Traceback (most recent call last):
ret_val = func(*args, **kwargs)
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in <module>
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1317, in __reduce_and_partition_ipg_grads
launch()
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
run_exp()
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 92, in run_exp
self.partition_grads(self.params_in_ipg_bucket, grad_partitions)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
_training_function(config={"args": args, "callbacks": callbacks})
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 66, in _training_function
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1484, in partition_grads
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 101, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2123, in train
cuda_grad_buffer = grad_buffer.to(grad_partition.device, non_blocking=True)
RuntimeError: ACL stream synchronize failed.
return inner_training_loop(
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2478, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 3610, in training_step
self.accelerator.backward(loss, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/accelerator.py", line 2239, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
self.engine.backward(loss, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1173, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1528, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1265, in reduce_independent_p_g_buckets_and_remove_grads
self.__reduce_and_partition_ipg_grads()
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1317, in __reduce_and_partition_ipg_grads
self.partition_grads(self.params_in_ipg_bucket, grad_partitions)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1484, in partition_grads
cuda_grad_buffer = grad_buffer.to(grad_partition.device, non_blocking=True)
RuntimeError: ACL stream synchronize failed.
**EL0004: 2025-01-31-08:23:58.845.175 Failed to allocate memory.
Possible Cause: Available memory is insufficient.
Solution: Close applications not in use.
TraceBack (most recent call last):
rtMalloc execute failed, reason=[driver error:out of memory][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
alloc device memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]**
The error from device(chipId:7, dieId:0), serial number is 2, event wait timeout occurred during task execution, stream_id:6, sq_id:6, task_id:9035, event_id=1522, timeout=1868.[FUNC:ProcessStarsWaitTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1389]
Task execute failed, device_id=7, stream_id=6, task_id=9035, flip_num=720, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1082]
rtStreamSynchronizeWithTimeout execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
synchronize stream failed, runtime result = 107020[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
Traceback (most recent call last):
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in <module>
launch()
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
run_exp()
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 92, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 66, in _training_function
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 101, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2123, in train
return inner_training_loop(
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2478, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 3610, in training_step
self.accelerator.backward(loss, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/accelerator.py", line 2239, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
self.engine.backward(loss, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
**EL0004: 2025-01-28-17:25:09.561.058 Failed to allocate memory.**
**Possible Cause: Available memory is insufficient.
Solution: Close applications not in use.
TraceBack (most recent call last):
rtMalloc execute failed, reason=[driver error:out of memory][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
alloc device memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]**
The error from device(chipId:2, dieId:0), serial number is 2, event wait timeout occurred during task execution, stream_id:6, sq_id:6, task_id:5930, event_id=4980, timeout=1868.[FUNC:ProcessStarsWaitTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1389]
Task execute failed, device_id=2, stream_id=6, task_id=5930, flip_num=720, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1082]
rtStreamSynchronizeWithTimeout execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
synchronize stream failed, runtime result = 107020[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1173, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1528, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1265, in reduce_independent_p_g_buckets_and_remove_grads
self.__reduce_and_partition_ipg_grads()
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1317, in __reduce_and_partition_ipg_grads
self.partition_grads(self.params_in_ipg_bucket, grad_partitions)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1484, in partition_grads
cuda_grad_buffer = grad_buffer.to(grad_partition.device, non_blocking=True)
RuntimeError: ACL stream synchronize failed.
Reminder
System Info
设备AICC 8*昇腾910B1 CANN 8.0.RC2
训练模型:qwen2.5-32B-instruct
训练类型:full
Platform: Linux-notebook4.19.90-vhulk2211.3.0.h1543.eulerosv2r10.aarch64
Python==3.9.18
llamafactory==0.9.2.dev0
torch==2.1.0
torch-npu==2.1.0.post3
transformers==4.46.1
datasets==3.1.0
accelerate==1.0.1
peft==0.12.0
trl==0.9.6
deepspeed==0.15.4
此外:按照https://www.hiascend.com/document/detail/zh/Pytorch/60RC2/ptmoddevg/trainingmigrguide/performance_tuning_0027.html
的指导替换flash-attn2的算子(修改transformers相关底层包以及src/llamafactory/model/model_utils/attention.py)
Reproduction
训练配置文件如下
启动训练
ds相关日志
训练一段时间后显存爆掉
此前训练batch_size为128时,训练大约在18%的时候报同样错误,结合日志判断应该是先前batch训练的资源没能释放,最终累积导致OOM。
希望能提供相关解决思路,比如定时定点手动释放内存等,谢谢。
The text was updated successfully, but these errors were encountered: