You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It works fine for 200/588 training steps, but then CUDA OOM apears
[INFO|2025-02-04 23:39:20] llamafactory.model.model_utils.checkpointing:157 >> Gradient checkpointing enabled.
[INFO|2025-02-04 23:39:20] llamafactory.model.model_utils.attention:157 >> Using torch SDPA for faster training and inference.
[INFO|2025-02-04 23:39:20] llamafactory.model.adapter:157 >> Upcasting trainable params to float32.
[INFO|2025-02-04 23:39:20] llamafactory.model.adapter:157 >> Fine-tuning method: LoRA
[INFO|2025-02-04 23:39:20] llamafactory.model.model_utils.misc:157 >> Found linear modules: o_proj,gate_proj,v_proj,k_proj,q_proj,down_proj,up_proj
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 9/9 [00:09<00:00, 1.10s/it]
[INFO|2025-02-04 23:39:21] llamafactory.model.loader:157 >> trainable params: 103,546,880 || all params: 2,206,212,096 || trainable%: 4.6934
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:667] 2025-02-04 23:39:21,477 >> Using auto half precision backend
[INFO|trainer.py:2243] 2025-02-04 23:39:24,619 >>***** Running training *****
[INFO|trainer.py:2244] 2025-02-04 23:39:24,620 >> Num examples = 3,141
[INFO|trainer.py:2245] 2025-02-04 23:39:24,620 >> Num Epochs = 3
[INFO|trainer.py:2246] 2025-02-04 23:39:24,620 >> Instantaneous batch size per device = 1
[INFO|trainer.py:2249] 2025-02-04 23:39:24,620 >> Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:2250] 2025-02-04 23:39:24,620 >> Gradient Accumulation steps = 8
[INFO|trainer.py:2251] 2025-02-04 23:39:24,620 >> Total optimization steps = 588
[INFO|trainer.py:2252] 2025-02-04 23:39:24,627 >> Number of trainable parameters = 103,546,880
{'loss': 1.4625, 'grad_norm': 1.5458040237426758, 'learning_rate': 1.694915254237288e-05, 'epoch': 0.05}
{'loss': 1.147, 'grad_norm': 1.6066330671310425, 'learning_rate': 3.389830508474576e-05, 'epoch': 0.1}
{'loss': 0.7523, 'grad_norm': 0.9638422131538391, 'learning_rate': 5.0847457627118643e-05, 'epoch': 0.15}
{'loss': 0.5633, 'grad_norm': 0.3667442500591278, 'learning_rate': 6.779661016949152e-05, 'epoch': 0.2}
{'loss': 0.5748, 'grad_norm': 0.553389847278595, 'learning_rate': 8.474576271186441e-05, 'epoch': 0.25}
{'loss': 0.5406, 'grad_norm': 0.5546717047691345, 'learning_rate': 9.999911828722436e-05, 'epoch': 0.31}
{'loss': 0.5377, 'grad_norm': 0.7597773671150208, 'learning_rate': 9.98933503759762e-05, 'epoch': 0.36}
{'loss': 0.5856, 'grad_norm': 3.359168529510498, 'learning_rate': 9.961166724127393e-05, 'epoch': 0.41}
{'loss': 0.5246, 'grad_norm': 1.0365256071090698, 'learning_rate': 9.915506204856368e-05, 'epoch': 0.46}
{'loss': 0.5063, 'grad_norm': 0.7878265976905823, 'learning_rate': 9.852514470786153e-05, 'epoch': 0.51}
{'loss': 0.4964, 'grad_norm': 0.5650718212127686, 'learning_rate': 9.77241361974925e-05, 'epoch': 0.56}
{'loss': 0.4832, 'grad_norm': 0.837906002998352, 'learning_rate': 9.675486073330953e-05, 'epoch': 0.61}
{'loss': 0.4892, 'grad_norm': 0.6627397537231445, 'learning_rate': 9.562073581100267e-05, 'epoch': 0.66}
{'loss': 0.497, 'grad_norm': 0.6025792360305786, 'learning_rate': 9.432576015660714e-05, 'epoch': 0.71}
{'loss': 0.4927, 'grad_norm': 0.6699998378753662, 'learning_rate': 9.287449962769499e-05, 'epoch': 0.76}
{'loss': 0.474, 'grad_norm': 0.6681831479072571, 'learning_rate': 9.12720711149603e-05, 'epoch': 0.81}
{'loss': 0.5353, 'grad_norm': 0.6149445176124573, 'learning_rate': 8.952412450095778e-05, 'epoch': 0.87}
{'loss': 0.4797, 'grad_norm': 0.8034713268280029, 'learning_rate': 8.76368227396056e-05, 'epoch': 0.92}
{'loss': 0.471, 'grad_norm': 0.6485044956207275, 'learning_rate': 8.561682012668805e-05, 'epoch': 0.97}
{'loss': 0.4451, 'grad_norm': 0.7033519744873047, 'learning_rate': 8.347123883797312e-05, 'epoch': 1.02}
34%|█████████████████████████▋ | 201/588 [1:23:10<2:39:23, 24.71s/it][rank0]: Traceback (most recent call last):
[rank0]: File "/packages/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in<module>
[rank0]: launch()
[rank0]: File "/packages/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/packages/LLaMA-Factory/src/llamafactory/train/tuner.py", line 92, in run_exp
[rank0]: _training_function(config={"args": args, "callbacks": callbacks})
[rank0]: File "/packages/LLaMA-Factory/src/llamafactory/train/tuner.py", line 66, in _training_function
[rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]: File "/packages/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 101, in run_sft
[rank0]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/environments/llamafactory_env/lib64/python3.11/site-packages/transformers/trainer.py", line 2052, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/environments/llamafactory_env/lib64/python3.11/site-packages/transformers/trainer.py", line 2388, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/environments/llamafactory_env/lib64/python3.11/site-packages/transformers/trainer.py", line 3518, in training_step
[rank0]: self.accelerator.backward(loss, **kwargs)
[rank0]: File "/environments/llamafactory_env/lib64/python3.11/site-packages/accelerate/accelerator.py", line 2248, in backward
[rank0]: loss.backward(**kwargs)
[rank0]: File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/_tensor.py", line 626, in backward
[rank0]: torch.autograd.backward(
[rank0]: File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/autograd/__init__.py", line 347, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/autograd/function.py", line 307, in apply
[rank0]: return user_fn(self, *args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/utils/checkpoint.py", line 321, in backward
[rank0]: torch.autograd.backward(outputs_with_grad, args_with_grad)
[rank0]: File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/autograd/__init__.py", line 347, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/autograd/function.py", line 307, in apply
[rank0]: return user_fn(self, *args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/environments/llamafactory_env/lib64/python3.11/site-packages/awq/modules/linear/gemm.py", line 109, in backward
[rank0]: grad_input = grad_output.bmm(weights.transpose(0, 1).unsqueeze(0).repeat(batch_size, 1, 1))
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 47.71 GiB of which 54.31 MiB is free. Including non-PyTorch memory, this proc
ess has 42.66 GiB memory in use. Of the allocated memory 41.00 GiB is allocated by PyTorch, and 1.01 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large t
ry setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-var
iables)
34%|█████████████████████████▋ | 201/588 [1:23:25<2:40:37, 24.90s/it]
[rank0]:[W205 01:02:50.586290846 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see
https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0205 01:02:51.399000 907678 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 907696 closing signal SIGTERM
E0205 01:02:51.613000 907678 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 907695) of binary: /environments/llamafactory_env/bin/
python3.11
Traceback (most recent call last):
File "/environments/llamafactory_env/bin/torchrun", line 8, in<module>sys.exit(main())
^^^^^^
File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/packages/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time: 2025-02-05_01:02:51
host : localhost.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 907695)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Thank you and best regards for any help!
ps.: on a side node, I am also confused why it detects only 2.2B and not 70B params, but I am more worried about the CUDA OOM [INFO|2025-02-05 17:58:17] llamafactory.model.loader:157 >> trainable params: 103,546,880 || all params: 2,206,212,096 || trainable%: 4.6934
Others
No response
The text was updated successfully, but these errors were encountered:
Reminder
System Info
(llamafactory_env) [localhost.com LLaMA-Factory]$ llamafactory-cli env - `llamafactory` version: 0.9.2.dev0 - Platform: Linux-4.18.0-553.34.1.el8_10.x86_64-x86_64-with-glibc2.28 - Python version: 3.11.11 - PyTorch version: 2.6.0+cu124 (GPU) - Transformers version: 4.45.2 - Datasets version: 3.2.0 - Accelerate version: 1.2.1 - PEFT version: 0.12.0 - TRL version: 0.9.6 - GPU type: NVIDIA L40S-48C
Reproduction
I am using this model: https://huggingface.co/casperhansen/llama-3.3-70b-instruct-awq
I have 2x L40S with 48GB VRAM each, this should be enough for the finetuning, shouldn't it?
This is my
examples/train_qlora/llama3_lora_sft_awq.yaml
It works fine for 200/588 training steps, but then CUDA OOM apears
Thank you and best regards for any help!
ps.: on a side node, I am also confused why it detects only 2.2B and not 70B params, but I am more worried about the CUDA OOM
[INFO|2025-02-05 17:58:17] llamafactory.model.loader:157 >> trainable params: 103,546,880 || all params: 2,206,212,096 || trainable%: 4.6934
Others
No response
The text was updated successfully, but these errors were encountered: