Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPU ds3_ofld训练不释放内存最终OOM #6816

Open
1 task done
ultramangod opened this issue Feb 5, 2025 · 0 comments
Open
1 task done

NPU ds3_ofld训练不释放内存最终OOM #6816

ultramangod opened this issue Feb 5, 2025 · 0 comments
Labels
bug Something isn't working npu This problem is related to NPU devices pending This problem is yet to be addressed

Comments

@ultramangod
Copy link

Reminder

  • I have read the above rules and searched the existing issues.

System Info

设备AICC 8*昇腾910B1 CANN 8.0.RC2
训练模型:qwen2.5-32B-instruct
训练类型:full

Platform: Linux-notebook4.19.90-vhulk2211.3.0.h1543.eulerosv2r10.aarch64
Python==3.9.18
llamafactory==0.9.2.dev0
torch==2.1.0
torch-npu==2.1.0.post3
transformers==4.46.1
datasets==3.1.0
accelerate==1.0.1
peft==0.12.0
trl==0.9.6
deepspeed==0.15.4

此外:按照https://www.hiascend.com/document/detail/zh/Pytorch/60RC2/ptmoddevg/trainingmigrguide/performance_tuning_0027.html
的指导替换flash-attn2的算子(修改transformers相关底层包以及src/llamafactory/model/model_utils/attention.py)

Reproduction

使用deepspeed_z3_offload配置来分担显存,设备提供1500G的内存用以offload。
deepspeed的配置文件如下
```json
{
  "train_batch_size": "64",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  }
}

训练配置文件如下

### model
model_name_or_path: /home/ma-user/work/Models/Qwen2___5-32B-Instruct
trust_remote_code: true
#enable_liger_kernel: true
#use_unsloth_gc: true


### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_offload_config.json  # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]

### dataset
dataset: mixv4
template: qwen
cutoff_len: 32768
#max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/qwen2___5_32B_3/full/sft
logging_steps: 1
save_steps: 200
plot_loss: true
overwrite_output_dir: true
save_only_model: true
save_total_limit: 30

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-5
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.0
bf16: true
ddp_timeout: 180000000
flash_attn: fa2
#resume_from_checkpoint: saves/qwen2___5_32B/full/sft/checkpoint-2

### eval
val_size: 0.0015
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

启动训练

ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7  llamafactory-cli train examples/train_full/qwen_full_sft.yaml

ds相关日志

Adam Optimizer #0 is created with scalar arithmetic capability.
Config: alpha=0.000010, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1
[2025-01-28 12:49:05,766] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.15.4, git-hash=unknown, git-branch=unknown
[2025-01-28 12:49:05,766] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
Loading extension module cpu_adam...
Time to load cpu_adam op: 1.6930582523345947 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 1.661719560623169 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 1.6968154907226562 seconds
[2025-01-28 12:49:05,810] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2025-01-28 12:49:05,814] [INFO] [logging.py:128:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2025-01-28 12:49:05,814] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
Loading extension module cpu_adam...
Time to load cpu_adam op: 1.7483572959899902 seconds
[2025-01-28 12:49:05,961] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2025-01-28 12:49:05,961] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2025-01-28 12:49:05,961] [INFO] [logging.py:128:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2025-01-28 12:49:05,961] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2025-01-28 12:49:06,439] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning
[2025-01-28 12:49:06,440] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 4.35 GB         CA 0.0 GB         Max_CA 4 GB 
[2025-01-28 12:49:06,441] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 120.58 GB, percent = 8.0%
[2025-01-28 12:49:06,448] [INFO] [stage3.py:166:__init__] Reduce bucket size 26214400
[2025-01-28 12:49:06,448] [INFO] [stage3.py:167:__init__] Prefetch bucket size 23592960
[2025-01-28 12:49:06,862] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-01-28 12:49:06,864] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2025-01-28 12:49:06,864] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 120.58 GB, percent = 8.0%
Parameter Offload: Total persistent parameters: 1119232 in 321 params
[2025-01-28 12:49:07,736] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2025-01-28 12:49:07,738] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2025-01-28 12:49:07,739] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 120.57 GB, percent = 8.0%
[2025-01-28 12:49:08,613] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions
[2025-01-28 12:49:08,614] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2025-01-28 12:49:08,615] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 120.56 GB, percent = 8.0%
[2025-01-28 12:49:23,857] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 5
[2025-01-28 12:49:23,859] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2025-01-28 12:49:23,859] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 204.98 GB, percent = 13.6%
[2025-01-28 12:49:24,246] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions
[2025-01-28 12:49:24,247] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2025-01-28 12:49:24,248] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 207.65 GB, percent = 13.7%
[2025-01-28 12:50:00,191] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions
[2025-01-28 12:50:00,192] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2025-01-28 12:50:00,192] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 304.03 GB, percent = 20.1%
[2025-01-28 12:50:00,576] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2025-01-28 12:50:00,578] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2025-01-28 12:50:00,578] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 310.11 GB, percent = 20.5%
[2025-01-28 12:50:19,156] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2025-01-28 12:50:19,157] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2025-01-28 12:50:19,157] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 425.79 GB, percent = 28.2%
[2025-01-28 12:50:19,159] [INFO] [stage3.py:521:_setup_for_real_optimizer] optimizer state initialized
len bsampler is : 150156
len bsampler is : sampling 150156 sps in sum 150156150156

len bsampler is :len bsampler is : sampling 150156 sps in sum 150156 150156
150156

sampling 150156 sps in sum 150156sampling 150156 sps in sum 150156

len bsampler is : 150156
sampling 150156 sps in sum 150156
len bsampler is : 150156
sampling 150156 sps in sum 150156
len bsampler is : 150156
sampling 150156 sps in sum 150156
[2025-01-28 12:50:34,952] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2025-01-28 12:50:34,954] [INFO] [utils.py:782:see_memory_usage] MA 0.05 GB         Max_MA 2.95 GB         CA 2.96 GB         Max_CA 3 GB 
[2025-01-28 12:50:34,954] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 490.06 GB, percent = 32.4%
[2025-01-28 12:50:34,954] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3
[2025-01-28 12:50:34,954] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = None
[2025-01-28 12:50:34,955] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2025-01-28 12:50:34,955] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[1e-05, 1e-05], mom=[(0.9, 0.999), (0.9, 0.999)]
[2025-01-28 12:50:34,960] [INFO] [config.py:999:print] DeepSpeedEngine configuration:
[2025-01-28 12:50:34,960] [INFO] [config.py:1003:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2025-01-28 12:50:34,961] [INFO] [config.py:1003:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2025-01-28 12:50:34,961] [INFO] [config.py:1003:print]   amp_enabled .................. False
[2025-01-28 12:50:34,961] [INFO] [config.py:1003:print]   amp_params ................... False
[2025-01-28 12:50:34,961] [INFO] [config.py:1003:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2025-01-28 12:50:34,961] [INFO] [config.py:1003:print]   bfloat16_enabled ............. True
[2025-01-28 12:50:34,961] [INFO] [config.py:1003:print]   bfloat16_immediate_grad_update  False
[2025-01-28 12:50:34,961] [INFO] [config.py:1003:print]   checkpoint_parallel_write_pipeline  False
[2025-01-28 12:50:34,961] [INFO] [config.py:1003:print]   checkpoint_tag_validation_enabled  True
[2025-01-28 12:50:34,961] [INFO] [config.py:1003:print]   checkpoint_tag_validation_fail  False
[2025-01-28 12:50:34,962] [INFO] [config.py:1003:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0xfffed25e9310>
[2025-01-28 12:50:34,962] [INFO] [config.py:1003:print]   communication_data_type ...... None
[2025-01-28 12:50:34,962] [INFO] [config.py:1003:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-01-28 12:50:34,962] [INFO] [config.py:1003:print]   curriculum_enabled_legacy .... False
[2025-01-28 12:50:34,962] [INFO] [config.py:1003:print]   curriculum_params_legacy ..... False
[2025-01-28 12:50:34,962] [INFO] [config.py:1003:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-01-28 12:50:34,962] [INFO] [config.py:1003:print]   data_efficiency_enabled ...... False
[2025-01-28 12:50:34,962] [INFO] [config.py:1003:print]   dataloader_drop_last ......... False
[2025-01-28 12:50:34,962] [INFO] [config.py:1003:print]   disable_allgather ............ False
[2025-01-28 12:50:34,962] [INFO] [config.py:1003:print]   dump_state ................... False
[2025-01-28 12:50:34,962] [INFO] [config.py:1003:print]   dynamic_loss_scale_args ...... None
[2025-01-28 12:50:34,962] [INFO] [config.py:1003:print]   eigenvalue_enabled ........... False
[2025-01-28 12:50:34,962] [INFO] [config.py:1003:print]   eigenvalue_gas_boundary_resolution  1
[2025-01-28 12:50:34,962] [INFO] [config.py:1003:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2025-01-28 12:50:34,962] [INFO] [config.py:1003:print]   eigenvalue_layer_num ......... 0
[2025-01-28 12:50:34,962] [INFO] [config.py:1003:print]   eigenvalue_max_iter .......... 100
[2025-01-28 12:50:34,963] [INFO] [config.py:1003:print]   eigenvalue_stability ......... 1e-06
[2025-01-28 12:50:34,963] [INFO] [config.py:1003:print]   eigenvalue_tol ............... 0.01
[2025-01-28 12:50:34,963] [INFO] [config.py:1003:print]   eigenvalue_verbose ........... False
[2025-01-28 12:50:34,963] [INFO] [config.py:1003:print]   elasticity_enabled ........... False
[2025-01-28 12:50:34,963] [INFO] [config.py:1003:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2025-01-28 12:50:34,963] [INFO] [config.py:1003:print]   fp16_auto_cast ............... None
[2025-01-28 12:50:34,963] [INFO] [config.py:1003:print]   fp16_enabled ................. False
[2025-01-28 12:50:34,963] [INFO] [config.py:1003:print]   fp16_master_weights_and_gradients  False
[2025-01-28 12:50:34,963] [INFO] [config.py:1003:print]   global_rank .................. 0
[2025-01-28 12:50:34,963] [INFO] [config.py:1003:print]   grad_accum_dtype ............. None
[2025-01-28 12:50:34,963] [INFO] [config.py:1003:print]   gradient_accumulation_steps .. 8
[2025-01-28 12:50:34,963] [INFO] [config.py:1003:print]   gradient_clipping ............ 1.0
[2025-01-28 12:50:34,963] [INFO] [config.py:1003:print]   gradient_predivide_factor .... 1.0
[2025-01-28 12:50:34,963] [INFO] [config.py:1003:print]   graph_harvesting ............. False
[2025-01-28 12:50:34,963] [INFO] [config.py:1003:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-01-28 12:50:34,963] [INFO] [config.py:1003:print]   initial_dynamic_scale ........ 1
[2025-01-28 12:50:34,964] [INFO] [config.py:1003:print]   load_universal_checkpoint .... False
[2025-01-28 12:50:34,964] [INFO] [config.py:1003:print]   loss_scale ................... 1.0
[2025-01-28 12:50:34,964] [INFO] [config.py:1003:print]   memory_breakdown ............. False
[2025-01-28 12:50:34,964] [INFO] [config.py:1003:print]   mics_hierarchial_params_gather  False
[2025-01-28 12:50:34,964] [INFO] [config.py:1003:print]   mics_shard_size .............. -1
[2025-01-28 12:50:34,964] [INFO] [config.py:1003:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2025-01-28 12:50:34,964] [INFO] [config.py:1003:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2025-01-28 12:50:34,964] [INFO] [config.py:1003:print]   optimizer_legacy_fusion ...... False
[2025-01-28 12:50:34,964] [INFO] [config.py:1003:print]   optimizer_name ............... None
[2025-01-28 12:50:34,964] [INFO] [config.py:1003:print]   optimizer_params ............. None
[2025-01-28 12:50:34,964] [INFO] [config.py:1003:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-01-28 12:50:34,964] [INFO] [config.py:1003:print]   pld_enabled .................. False
[2025-01-28 12:50:34,964] [INFO] [config.py:1003:print]   pld_params ................... False
[2025-01-28 12:50:34,964] [INFO] [config.py:1003:print]   prescale_gradients ........... False
[2025-01-28 12:50:34,965] [INFO] [config.py:1003:print]   scheduler_name ............... None
[2025-01-28 12:50:34,965] [INFO] [config.py:1003:print]   scheduler_params ............. None
[2025-01-28 12:50:34,965] [INFO] [config.py:1003:print]   seq_parallel_communication_data_type  torch.float32
[2025-01-28 12:50:34,965] [INFO] [config.py:1003:print]   sparse_attention ............. None
[2025-01-28 12:50:34,965] [INFO] [config.py:1003:print]   sparse_gradients_enabled ..... False
[2025-01-28 12:50:34,965] [INFO] [config.py:1003:print]   steps_per_print .............. inf
[2025-01-28 12:50:34,965] [INFO] [config.py:1003:print]   timers_config ................ enabled=True synchronized=True
[2025-01-28 12:50:34,965] [INFO] [config.py:1003:print]   train_batch_size ............. 64
[2025-01-28 12:50:34,965] [INFO] [config.py:1003:print]   train_micro_batch_size_per_gpu  1
[2025-01-28 12:50:34,965] [INFO] [config.py:1003:print]   use_data_before_expert_parallel_  False
[2025-01-28 12:50:34,965] [INFO] [config.py:1003:print]   use_node_local_storage ....... False
[2025-01-28 12:50:34,965] [INFO] [config.py:1003:print]   wall_clock_breakdown ......... False
[2025-01-28 12:50:34,965] [INFO] [config.py:1003:print]   weight_quantization_config ... None
[2025-01-28 12:50:34,965] [INFO] [config.py:1003:print]   world_size ................... 8
[2025-01-28 12:50:34,965] [INFO] [config.py:1003:print]   zero_allow_untested_optimizer  True
[2025-01-28 12:50:34,965] [INFO] [config.py:1003:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=26214400 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=23592960 param_persistence_threshold=51200 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2025-01-28 12:50:34,966] [INFO] [config.py:1003:print]   zero_enabled ................. True
[2025-01-28 12:50:34,966] [INFO] [config.py:1003:print]   zero_force_ds_cpu_optimizer .. True
[2025-01-28 12:50:34,966] [INFO] [config.py:1003:print]   zero_optimization_stage ...... 3
[2025-01-28 12:50:34,966] [INFO] [config.py:989:print_user_config]   json = {
    "train_batch_size": 64, 
    "train_micro_batch_size_per_gpu": 1, 
    "gradient_accumulation_steps": 8, 
    "gradient_clipping": 1.0, 
    "zero_allow_untested_optimizer": true, 
    "fp16": {
        "enabled": false, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "initial_scale_power": 16, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "zero_optimization": {
        "stage": 3, 
        "offload_optimizer": {
            "device": "cpu", 
            "pin_memory": true
        }, 
        "offload_param": {
            "device": "cpu", 
            "pin_memory": true
        }, 
        "overlap_comm": true, 
        "contiguous_gradients": true, 
        "sub_group_size": 1.000000e+09, 
        "reduce_bucket_size": 2.621440e+07, 
        "stage3_prefetch_bucket_size": 2.359296e+07, 
        "stage3_param_persistence_threshold": 5.120000e+04, 
        "stage3_max_live_parameters": 1.000000e+09, 
        "stage3_max_reuse_distance": 1.000000e+09, 
        "stage3_gather_16bit_weights_on_model_save": true
    }, 
    "steps_per_print": inf
}
[INFO|trainer.py:2317] 2025-01-28 12:50:34,966 >> ***** Running training *****

训练一段时间后显存爆掉

{'loss': 0.6854, 'grad_norm': 1.2066658735275269, 'learning_rate': 7.270549063924843e-06, 'epoch': 0.35}

 35%|█████████████████████████████████████████████████████████████████▍                                                                                                                         | 821/2346 [98:44:29<189:15:35, 446.78s/it]
 35%|█████████████████████████████████████████████████████████████████▌                                                                                                                         | 822/2346 [98:51:09<183:14:26, 432.85s/it]
                                                                                                                                                                                                                                           
{'loss': 0.6851, 'grad_norm': 0.9848567247390747, 'learning_rate': 7.264581581159024e-06, 'epoch': 0.35}

 35%|█████████████████████████████████████████████████████████████████▌                                                                                                                         | 822/2346 [98:51:09<183:14:26, 432.85s/it]EE9999: Inner Error!
EE9999: 2025-02-01-16:16:28.694.058  The error from device(chipId:0, dieId:0), serial number is 4, event wait timeout occurred during task execution, stream_id:22, sq_id:22, task_id:10603, event_id=3395, timeout=1868.[FUNC:ProcessStarsWaitTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1389]
        TraceBack (most recent call last):
        Task execute failed, device_id=0, stream_id=22, task_id=10603, flip_num=720, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1082]
        rtStreamSynchronizeWithTimeout execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        synchronize stream failed, runtime result = 107020[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

**EL0004: 2025-01-28-17:13:23.993.343 Failed to allocate memory.
        Possible Cause: Available memory is insufficient.
        Solution: Close applications not in use.
        TraceBack (most recent call last):
        rtMalloc execute failed, reason=[driver error:out of memory][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        alloc device memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]**
        The error from device(chipId:1, dieId:0), serial number is 2, event wait timeout occurred during task execution, stream_id:6, sq_id:6, task_id:9295, event_id=3015, timeout=1868.[FUNC:ProcessStarsWaitTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1389]
        Task execute failed, device_id=1, stream_id=6, task_id=9295, flip_num=720, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1082]
        rtStreamSynchronizeWithTimeout execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        synchronize stream failed, runtime result = 107020[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

Traceback (most recent call last):
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in <module>
    launch()
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
    run_exp()
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 92, in run_exp
Traceback (most recent call last):
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in <module>
    _training_function(config={"args": args, "callbacks": callbacks})
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 66, in _training_function
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 101, in run_sft
    launch()
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
    run_exp()
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 92, in run_exp
    _training_function(config={"args": args, "callbacks": callbacks})
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 66, in _training_function
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 101, in run_sft
        train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)

  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2123, in train
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2123, in train
    return inner_training_loop(
      File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2478, in _inner_training_loop
return inner_training_loop(
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2478, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 3610, in training_step
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 3610, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/accelerator.py", line 2239, in backward
    self.accelerator.backward(loss, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/accelerator.py", line 2239, in backward
        self.deepspeed_engine_wrapped.backward(loss, **kwargs)self.deepspeed_engine_wrapped.backward(loss, **kwargs)

  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
        self.engine.backward(loss, **kwargs)self.engine.backward(loss, **kwargs)

  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
        ret_val = func(*args, **kwargs)ret_val = func(*args, **kwargs)

  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
      File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)    
ret_val = func(*args, **kwargs)  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward

  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
        scaled_loss.backward(retain_graph=retain_graph)scaled_loss.backward(retain_graph=retain_graph)

  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
      File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
torch.autograd.backward(
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
      File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1173, in reduce_partition_and_remove_grads
ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1173, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1528, in reduce_ready_partitions_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1528, in reduce_ready_partitions_and_remove_grads
        self.reduce_independent_p_g_buckets_and_remove_grads(param)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1265, in reduce_independent_p_g_buckets_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1265, in reduce_independent_p_g_buckets_and_remove_grads
    self.__reduce_and_partition_ipg_grads()
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    self.__reduce_and_partition_ipg_grads()
      File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)    
return func(*args, **kwargs)  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1317, in __reduce_and_partition_ipg_grads

  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1317, in __reduce_and_partition_ipg_grads
    self.partition_grads(self.params_in_ipg_bucket, grad_partitions)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    self.partition_grads(self.params_in_ipg_bucket, grad_partitions)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1484, in partition_grads
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1484, in partition_grads
    cuda_grad_buffer = grad_buffer.to(grad_partition.device, non_blocking=True)
    RuntimeErrorcuda_grad_buffer = grad_buffer.to(grad_partition.device, non_blocking=True): 
ACL stream synchronize failed.
RuntimeError: ACL stream synchronize failed.
EE9999: Inner Error!
EE9999: 2025-02-01-16:16:28.755.133  The error from device(chipId:3, dieId:0), serial number is 3, event wait timeout occurred during task execution, stream_id:6, sq_id:6, task_id:5388, event_id=4013, timeout=1868.[FUNC:ProcessStarsWaitTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1389]
        TraceBack (most recent call last):
        Task execute failed, device_id=3, stream_id=6, task_id=5388, flip_num=720, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1082]
        rtStreamSynchronizeWithTimeout execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        synchronize stream failed, runtime result = 107020[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

**EL0004: 2025-01-29-06:20:28.300.866 Failed to allocate memory.
        Possible Cause: Available memory is insufficient.
        Solution: Close applications not in use.
        TraceBack (most recent call last):
        rtMalloc execute failed, reason=[driver error:out of memory][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        alloc device memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        The error from device(chipId:5, dieId:0), serial number is 2, event wait timeout occurred during task execution, stream_id:6, sq_id:6, task_id:8355, event_id=2510, timeout=1868.[FUNC:ProcessStarsWaitTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1389]
        Task execute failed, device_id=5, stream_id=6, task_id=8355, flip_num=720, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1082]
        rtStreamSynchronizeWithTimeout execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        synchronize stream failed, runtime result = 107020[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]**

Traceback (most recent call last):
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in <module>
    launch()
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
    run_exp()
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 92, in run_exp
    _training_function(config={"args": args, "callbacks": callbacks})
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 66, in _training_function
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 101, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2123, in train
    return inner_training_loop(
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2478, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 3610, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/accelerator.py", line 2239, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1173, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1528, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1265, in reduce_independent_p_g_buckets_and_remove_grads
    self.__reduce_and_partition_ipg_grads()
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    Traceback (most recent call last):
ret_val = func(*args, **kwargs)
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in <module>
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1317, in __reduce_and_partition_ipg_grads
    launch()
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
    run_exp()
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 92, in run_exp
    self.partition_grads(self.params_in_ipg_bucket, grad_partitions)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    _training_function(config={"args": args, "callbacks": callbacks})
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 66, in _training_function
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1484, in partition_grads
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 101, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2123, in train
    cuda_grad_buffer = grad_buffer.to(grad_partition.device, non_blocking=True)
RuntimeError: ACL stream synchronize failed.
    return inner_training_loop(
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2478, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 3610, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/accelerator.py", line 2239, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1173, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1528, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1265, in reduce_independent_p_g_buckets_and_remove_grads
    self.__reduce_and_partition_ipg_grads()
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1317, in __reduce_and_partition_ipg_grads
    self.partition_grads(self.params_in_ipg_bucket, grad_partitions)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1484, in partition_grads
    cuda_grad_buffer = grad_buffer.to(grad_partition.device, non_blocking=True)
RuntimeError: ACL stream synchronize failed.
**EL0004: 2025-01-31-08:23:58.845.175 Failed to allocate memory.
        Possible Cause: Available memory is insufficient.
        Solution: Close applications not in use.
        TraceBack (most recent call last):
        rtMalloc execute failed, reason=[driver error:out of memory][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        alloc device memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]**
        The error from device(chipId:7, dieId:0), serial number is 2, event wait timeout occurred during task execution, stream_id:6, sq_id:6, task_id:9035, event_id=1522, timeout=1868.[FUNC:ProcessStarsWaitTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1389]
        Task execute failed, device_id=7, stream_id=6, task_id=9035, flip_num=720, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1082]
        rtStreamSynchronizeWithTimeout execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        synchronize stream failed, runtime result = 107020[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

Traceback (most recent call last):
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in <module>
    launch()
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch
    run_exp()
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 92, in run_exp
    _training_function(config={"args": args, "callbacks": callbacks})
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 66, in _training_function
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/home/ma-user/work/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 101, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2123, in train
    return inner_training_loop(
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 2478, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/transformers/trainer.py", line 3610, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/accelerator.py", line 2239, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
**EL0004: 2025-01-28-17:25:09.561.058 Failed to allocate memory.**
        **Possible Cause: Available memory is insufficient.
        Solution: Close applications not in use.
        TraceBack (most recent call last):
        rtMalloc execute failed, reason=[driver error:out of memory][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        alloc device memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]**
        The error from device(chipId:2, dieId:0), serial number is 2, event wait timeout occurred during task execution, stream_id:6, sq_id:6, task_id:5930, event_id=4980, timeout=1868.[FUNC:ProcessStarsWaitTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1389]
        Task execute failed, device_id=2, stream_id=6, task_id=5930, flip_num=720, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1082]
        rtStreamSynchronizeWithTimeout execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        synchronize stream failed, runtime result = 107020[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1173, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1528, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1265, in reduce_independent_p_g_buckets_and_remove_grads
    self.__reduce_and_partition_ipg_grads()
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1317, in __reduce_and_partition_ipg_grads
    self.partition_grads(self.params_in_ipg_bucket, grad_partitions)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/lmfcty/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1484, in partition_grads
    cuda_grad_buffer = grad_buffer.to(grad_partition.device, non_blocking=True)
RuntimeError: ACL stream synchronize failed.

此前训练batch_size为128时,训练大约在18%的时候报同样错误,结合日志判断应该是先前batch训练的资源没能释放,最终累积导致OOM。

希望能提供相关解决思路,比如定时定点手动释放内存等,谢谢。



### Others

_No response_
@ultramangod ultramangod added bug Something isn't working pending This problem is yet to be addressed labels Feb 5, 2025
@github-actions github-actions bot added the npu This problem is related to NPU devices label Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working npu This problem is related to NPU devices pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant