Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA OOM in the middle of QLoRA with Llama 3.3 70B 4-bit AWQ #6827

Open
1 task done
paolovic opened this issue Feb 5, 2025 · 0 comments
Open
1 task done

CUDA OOM in the middle of QLoRA with Llama 3.3 70B 4-bit AWQ #6827

paolovic opened this issue Feb 5, 2025 · 0 comments
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@paolovic
Copy link

paolovic commented Feb 5, 2025

Reminder

  • I have read the above rules and searched the existing issues.

System Info

(llamafactory_env) [localhost.com LLaMA-Factory]$ llamafactory-cli env

- `llamafactory` version: 0.9.2.dev0
- Platform: Linux-4.18.0-553.34.1.el8_10.x86_64-x86_64-with-glibc2.28
- Python version: 3.11.11
- PyTorch version: 2.6.0+cu124 (GPU)
- Transformers version: 4.45.2
- Datasets version: 3.2.0
- Accelerate version: 1.2.1
- PEFT version: 0.12.0
- TRL version: 0.9.6
- GPU type: NVIDIA L40S-48C

Reproduction

I am using this model: https://huggingface.co/casperhansen/llama-3.3-70b-instruct-awq

I have 2x L40S with 48GB VRAM each, this should be enough for the finetuning, shouldn't it?

This is my examples/train_qlora/llama3_lora_sft_awq.yaml

### model
model_name_or_path: /models/llama-3.3-70b-instruct-awq
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all

### dataset
dataset: translate
template: llama3
cutoff_len: 2048
#max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/llama3.3-70b/qlora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
#val_size: 0.1
#per_device_eval_batch_size: 1
#eval_strategy: steps
#eval_steps: 500

It works fine for 200/588 training steps, but then CUDA OOM apears

[INFO|2025-02-04 23:39:20] llamafactory.model.model_utils.checkpointing:157 >> Gradient checkpointing enabled.
[INFO|2025-02-04 23:39:20] llamafactory.model.model_utils.attention:157 >> Using torch SDPA for faster training and inference.
[INFO|2025-02-04 23:39:20] llamafactory.model.adapter:157 >> Upcasting trainable params to float32.
[INFO|2025-02-04 23:39:20] llamafactory.model.adapter:157 >> Fine-tuning method: LoRA
[INFO|2025-02-04 23:39:20] llamafactory.model.model_utils.misc:157 >> Found linear modules: o_proj,gate_proj,v_proj,k_proj,q_proj,down_proj,up_proj
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 9/9 [00:09<00:00,  1.10s/it]
[INFO|2025-02-04 23:39:21] llamafactory.model.loader:157 >> trainable params: 103,546,880 || all params: 2,206,212,096 || trainable%: 4.6934
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:667] 2025-02-04 23:39:21,477 >> Using auto half precision backend
[INFO|trainer.py:2243] 2025-02-04 23:39:24,619 >> ***** Running training *****
[INFO|trainer.py:2244] 2025-02-04 23:39:24,620 >>   Num examples = 3,141
[INFO|trainer.py:2245] 2025-02-04 23:39:24,620 >>   Num Epochs = 3
[INFO|trainer.py:2246] 2025-02-04 23:39:24,620 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:2249] 2025-02-04 23:39:24,620 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:2250] 2025-02-04 23:39:24,620 >>   Gradient Accumulation steps = 8
[INFO|trainer.py:2251] 2025-02-04 23:39:24,620 >>   Total optimization steps = 588
[INFO|trainer.py:2252] 2025-02-04 23:39:24,627 >>   Number of trainable parameters = 103,546,880
{'loss': 1.4625, 'grad_norm': 1.5458040237426758, 'learning_rate': 1.694915254237288e-05, 'epoch': 0.05}
{'loss': 1.147, 'grad_norm': 1.6066330671310425, 'learning_rate': 3.389830508474576e-05, 'epoch': 0.1}
{'loss': 0.7523, 'grad_norm': 0.9638422131538391, 'learning_rate': 5.0847457627118643e-05, 'epoch': 0.15}
{'loss': 0.5633, 'grad_norm': 0.3667442500591278, 'learning_rate': 6.779661016949152e-05, 'epoch': 0.2}
{'loss': 0.5748, 'grad_norm': 0.553389847278595, 'learning_rate': 8.474576271186441e-05, 'epoch': 0.25}
{'loss': 0.5406, 'grad_norm': 0.5546717047691345, 'learning_rate': 9.999911828722436e-05, 'epoch': 0.31}
{'loss': 0.5377, 'grad_norm': 0.7597773671150208, 'learning_rate': 9.98933503759762e-05, 'epoch': 0.36}
{'loss': 0.5856, 'grad_norm': 3.359168529510498, 'learning_rate': 9.961166724127393e-05, 'epoch': 0.41}
{'loss': 0.5246, 'grad_norm': 1.0365256071090698, 'learning_rate': 9.915506204856368e-05, 'epoch': 0.46}
{'loss': 0.5063, 'grad_norm': 0.7878265976905823, 'learning_rate': 9.852514470786153e-05, 'epoch': 0.51}
{'loss': 0.4964, 'grad_norm': 0.5650718212127686, 'learning_rate': 9.77241361974925e-05, 'epoch': 0.56}
{'loss': 0.4832, 'grad_norm': 0.837906002998352, 'learning_rate': 9.675486073330953e-05, 'epoch': 0.61}
{'loss': 0.4892, 'grad_norm': 0.6627397537231445, 'learning_rate': 9.562073581100267e-05, 'epoch': 0.66}
{'loss': 0.497, 'grad_norm': 0.6025792360305786, 'learning_rate': 9.432576015660714e-05, 'epoch': 0.71}
{'loss': 0.4927, 'grad_norm': 0.6699998378753662, 'learning_rate': 9.287449962769499e-05, 'epoch': 0.76}
{'loss': 0.474, 'grad_norm': 0.6681831479072571, 'learning_rate': 9.12720711149603e-05, 'epoch': 0.81}
{'loss': 0.5353, 'grad_norm': 0.6149445176124573, 'learning_rate': 8.952412450095778e-05, 'epoch': 0.87}
{'loss': 0.4797, 'grad_norm': 0.8034713268280029, 'learning_rate': 8.76368227396056e-05, 'epoch': 0.92}
{'loss': 0.471, 'grad_norm': 0.6485044956207275, 'learning_rate': 8.561682012668805e-05, 'epoch': 0.97}
{'loss': 0.4451, 'grad_norm': 0.7033519744873047, 'learning_rate': 8.347123883797312e-05, 'epoch': 1.02}
 34%|█████████████████████████▋                                                 | 201/588 [1:23:10<2:39:23, 24.71s/it][rank0]: Traceback (most recent call last):
[rank0]:   File "/packages/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank0]:     launch()
[rank0]:   File "/packages/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]:     run_exp()
[rank0]:   File "/packages/LLaMA-Factory/src/llamafactory/train/tuner.py", line 92, in run_exp
[rank0]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank0]:   File "/packages/LLaMA-Factory/src/llamafactory/train/tuner.py", line 66, in _training_function
[rank0]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]:   File "/packages/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 101, in run_sft
[rank0]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/environments/llamafactory_env/lib64/python3.11/site-packages/transformers/trainer.py", line 2052, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/environments/llamafactory_env/lib64/python3.11/site-packages/transformers/trainer.py", line 2388, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/environments/llamafactory_env/lib64/python3.11/site-packages/transformers/trainer.py", line 3518, in training_step
[rank0]:     self.accelerator.backward(loss, **kwargs)
[rank0]:   File "/environments/llamafactory_env/lib64/python3.11/site-packages/accelerate/accelerator.py", line 2248, in backward
[rank0]:     loss.backward(**kwargs)
[rank0]:   File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/_tensor.py", line 626, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/autograd/__init__.py", line 347, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/autograd/function.py", line 307, in apply
[rank0]:     return user_fn(self, *args)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/utils/checkpoint.py", line 321, in backward
[rank0]:     torch.autograd.backward(outputs_with_grad, args_with_grad)
[rank0]:   File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/autograd/__init__.py", line 347, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/autograd/function.py", line 307, in apply
[rank0]:     return user_fn(self, *args)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/environments/llamafactory_env/lib64/python3.11/site-packages/awq/modules/linear/gemm.py", line 109, in backward
[rank0]:     grad_input = grad_output.bmm(weights.transpose(0, 1).unsqueeze(0).repeat(batch_size, 1, 1))
[rank0]:                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 47.71 GiB of which 54.31 MiB is free. Including non-PyTorch memory, this proc
ess has 42.66 GiB memory in use. Of the allocated memory 41.00 GiB is allocated by PyTorch, and 1.01 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large t
ry setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-var
iables)
 34%|█████████████████████████▋                                                 | 201/588 [1:23:25<2:40:37, 24.90s/it]
[rank0]:[W205 01:02:50.586290846 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see
https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0205 01:02:51.399000 907678 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 907696 closing signal SIGTERM
E0205 01:02:51.613000 907678 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 907695) of binary: /environments/llamafactory_env/bin/
python3.11
Traceback (most recent call last):
  File "/environments/llamafactory_env/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/environments/llamafactory_env/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/packages/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-05_01:02:51
  host      : localhost.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 907695)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Thank you and best regards for any help!

ps.: on a side node, I am also confused why it detects only 2.2B and not 70B params, but I am more worried about the CUDA OOM
[INFO|2025-02-05 17:58:17] llamafactory.model.loader:157 >> trainable params: 103,546,880 || all params: 2,206,212,096 || trainable%: 4.6934

Others

No response

@paolovic paolovic added bug Something isn't working pending This problem is yet to be addressed labels Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant