8卡A800 80G全量微调qwen2.5 14b训练到一半OOM #2947

SJLMax · 2025-01-21T04:29:03Z

`nproc_per_node=8

TOKENIZERS_PARALLELISM=true SIZE_FACTOR=8 MAX_PIXELS=602112 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NPROC_PER_NODE=$nproc_per_node
python3 -m swift.cli.main sft
--model ../qwen_model/Qwen2_5-14B-Instruct
--train_type full
--model_type qwen2_5
--dataset ./data/kw1_cot/train_new.jsonl
--torch_dtype bfloat16
--num_train_epochs 1
--learning_rate 1e-5
--target_modules all-linear
--gradient_accumulation_steps $(expr 16 / $nproc_per_node)
--eval_steps 100
--save_steps 100
--save_total_limit 5
--logging_steps 5
--max_length 8192
--output_dir ./data/kw1_cot/output/qwen2_5-14B-Instruct
--system 'You are a helpful assistant.'
--warmup_ratio 0.05
--dataloader_num_workers 4
--deepspeed zero3`

SJLMax · 2025-01-21T05:13:10Z

torch 1.13.1+cu116
ms-swift 3.0.2.post1
transformers 4.40.0

报错信息，求问咋解决

Jintao-Huang · 2025-02-09T15:22:49Z

可以试试ddp & device_map 开2个进程 8张卡

Jintao-Huang closed this as completed Feb 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8卡A800 80G全量微调qwen2.5 14b训练到一半OOM #2947

8卡A800 80G全量微调qwen2.5 14b训练到一半OOM #2947

SJLMax commented Jan 21, 2025

SJLMax commented Jan 21, 2025 •

edited

Loading

Jintao-Huang commented Feb 9, 2025

8卡A800 80G全量微调qwen2.5 14b训练到一半OOM #2947

8卡A800 80G全量微调qwen2.5 14b训练到一半OOM #2947

Comments

SJLMax commented Jan 21, 2025

SJLMax commented Jan 21, 2025 • edited Loading

Jintao-Huang commented Feb 9, 2025

SJLMax commented Jan 21, 2025 •

edited

Loading