FAQ
NotImplementedError
NotImplementedError: Using RTX 3090 or 4000 series doesn’t support faster communication broadband via P2P or IB. Please set NCCL_P2P_DISABLE="1"
and NCCL_IB_DISABLE="1" or use
accelerate launch` which will do this automatically.
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
exits with return code = -9
https://github.com/deepseek-ai/DeepSeek-Coder/issues/55
https://github.com/deepseek-ai/DeepSeek-Coder/issues/54
[2023-11-28 10:00:34,590] [ERROR] [launch.py:321:sigkill_handler] ['/home/.python_libs/conda_env/deepseek/bin/python', '-u', 'finetune_deepseekcoder.py', '--local_rank=0', '--model_name_or_path', '/home/project/deepseek/DeepSeek-Coder-main/models/deepseek-coder-6.7b-instruct', '--data_path', '/home/project/deepseek/DeepSeek-Coder-main/data/test.json', '--output_dir', '/home/project/deepseek/DeepSeek-Coder-main/deepseek_finetune', '--num_train_epochs', '1', '--model_max_length', '1024', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '100', '--save_total_limit', '100', '--learning_rate', '2e-5', '--warmup_steps', '10', '--logging_steps', '1', '--lr_scheduler_type', 'cosine', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--deepspeed', 'configs/ds_config_zero3.json', '--bf16', 'True'] exits with return code = -9
modify deepspeed config configs/ds_config_zero3.json
, set pin_memory
to false
:
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": false
},
"offload_param": {
"device": "cpu",
"pin_memory": false
},
...
}