登录
注册
开源
企业版
高校版
搜索
帮助中心
使用条款
关于我们
开源
企业版
高校版
私有云
模力方舟
AI 队友
登录
注册
Gitee 2025 年度开源项目评选中
代码拉取完成,页面将自动刷新
开源项目
>
人工智能
>
AI-人工智能
&&
捐赠
捐赠前请先登录
取消
前往登录
扫描微信二维码支付
取消
支付完成
支付提示
将跳转至支付宝完成支付
确定
取消
Watch
不关注
关注所有动态
仅关注版本发行动态
关注但不提醒动态
101
Star
1.4K
Fork
947
GVP
MindSpore
/
mindformers
代码
Issues
130
Pull Requests
87
Wiki
统计
流水线
服务
质量分析
Jenkins for Gitee
腾讯云托管
腾讯云 Serverless
悬镜安全
阿里云 SAE
Codeblitz
SBOM
我知道了,不再自动展开
更新失败,请稍后重试!
移除标识
内容风险标识
本任务被
标识为内容中包含有代码安全 Bug 、隐私泄露等敏感信息,仓库外成员不可访问
【MindFormers】【mcore】单机8卡预训练Qwen3-8B性能较差
TODO
#ID54UM
Question
YinanF
创建于
2025-11-05 14:33
问题描述: 单机8卡910B预训练Qwen3-8B模型性能较差,希望找到性能瓶颈所在。 从日志可以看出train_throughput_per_npu的训练吞吐只有不到20TFLOPs,折算得到MFU指标不足10%。  所用环境: ● Atlas 910B3 ● Ubuntu 22.04.5 LTS ● Python 3.11.4 ● Driver 24.1.rc2.1 ● CANN 8.2.RC1 ● MindSpore 2.7.0_20250919142025 ● MindFormers (master:a3ac66,基于mcore) 重现步骤: 预训练任务yaml如下: ```yaml seed: 42 output_dir: './output' load_checkpoint: '' load_ckpt_format: 'safetensors' src_strategy_path_or_dir: '' auto_trans_ckpt: False # If true, automatically transforms the loaded checkpoint for distributed model compatibility only_save_strategy: False resume_training: False use_parallel: True run_mode: 'train' use_legacy: False # Trainer configuration trainer: type: CausalLanguageModelingTrainer model_name: 'Qwen3' # Runner configuration runner_config: epochs: 1 batch_size: 1 gradient_accumulation_steps: 1 # Optimizer configuration optimizer: type: AdamW betas: [0.9, 0.95] eps: 1.e-8 weight_decay: 0.0 # Learning rate scheduler configuration lr_schedule: type: ConstantWarmUpLR learning_rate: 1.e-6 warmup_ratio: 0 total_steps: -1 # -1 indicates using the total steps from the dataset # Dataset configuration train_dataset: &train_dataset data_loader: type: BlendedMegatronDatasetDataLoader datasets_type: "GPTDataset" sizes: - 8000 # Number of samples in the training set - 0 # Number of samples in the test set (currently unsupported) - 0 # Number of samples in the evaluation set (currently unsupported) config: seed: 1234 # Random seed for data sampling split: "1, 0, 0" # Proportions for training, test, and evaluation sets (test/eval currently unsupported) seq_length: 4096 # Sequence length of the dataset eod_mask_loss: False # Whether to calculate loss at the end-of-document (EOD) reset_position_ids: False # Whether to reset position_ids at EOD create_attention_mask: True # Whether to include attention_mask in the dataset reset_attention_mask: False # Whether to reset attention_mask at EOD, creating a stepped attention_mask create_compressed_eod_mask: False # Whether to include a compressed attention_mask eod_pad_length: 128 # Length of the compressed attention_mask eod: 1 # Token ID for EOD in the dataset pad: -1 # Token ID for padding in the dataset data_path: # Sampling proportion and path for the Megatron dataset - '1' - "./datasets/wiki103-megatron_text_document" input_columns: ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"] construct_args_key: ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"] num_parallel_workers: 8 python_multiprocessing: False drop_remainder: True numa_enable: False prefetch_size: 1 seed: 1234 train_dataset_task: type: CausalLanguageModelDataset dataset_config: *train_dataset # MindSpore context initialization configuration, reference: https://www.mindspore.cn/docs/en/r2.6.0/api_python/mindspore/mindspore.set_context.html context: mode: 0 # 0--Graph Mode; 1--Pynative Mode device_target: "Ascend" # Target device to run (only supports "Ascend") max_device_memory: "59GB" # Maximum memory available for the device memory_optimize_level: "O0" # Memory optimization level jit_config: # Global JIT configuration for compilation jit_level: "O1" # Compilation optimization level ascend_config: # Parameters specific to the Ascend hardware platform # precision_mode: "must_keep_origin_dtype" # Mixed precision mode setting parallel_speed_up_json_path: "./configs/qwen3/parallel_speed_up.json" # Path to the parallel speedup JSON file # Parallel configuration parallel_config: data_parallel: &dp 4 # Number of data parallel model_parallel: 2 # Number of model parallel pipeline_stage: 1 # Number of pipeline parallel micro_batch_num: 1 # Pipeline parallel microbatch size use_seq_parallel: False # Whether to enable sequence parallelism gradient_aggregation_group: 1 # Size of the gradient communication operator fusion group # When model_parallel > 1, setting micro_batch_interleave_num to 2 may accelerate the training process. micro_batch_interleave_num: 1 # Parallel context configuration parallel: parallel_mode: 1 # 0--data parallel; 1--semi-auto parallel; 2--auto parallel; 3--hybrid parallel enable_alltoall: True # Enables AllToAll communication operator during parallel communication full_batch: False # Whether to load the full batch of data in parallel mode dataset_strategy: [ [*dp, 1], [*dp, 1], [*dp, 1], [*dp, 1], [*dp, 1, 1, 1] ] # Must match the length of train_dataset.input_columns search_mode: "sharding_propagation" # Fully-automatic parallel strategy search mode strategy_ckpt_config: save_file: "./ckpt_strategy.ckpt" # Path for saving the parallel slicing strategy file only_trainable_params: False # Whether to save/load slicing strategy for trainable parameters only enable_parallel_optimizer: True # Whether to enable optimizer parallelism parallel_optimizer_config: gradient_accumulation_shard: False parallel_optimizer_threshold: 64 optimizer_weight_shard_size: 4 # Recomputation configuration recompute_config: recompute: False select_recompute: False parallel_optimizer_comm_recompute: False mp_comm_recompute: False # Model configuration model: model_config: # Configurations from Hugging Face vocab_size: 151936 hidden_size: 4096 intermediate_size: 12288 num_hidden_layers: 36 num_attention_heads: 32 num_key_value_heads: 8 head_dim: 128 hidden_act: 'swiglu' max_position_embeddings: 4096 seq_length: 4096 initializer_range: 0.02 rms_norm_eps: 1.e-6 use_cache: True tie_word_embeddings: False rope_theta: 1000000. attention_bias: False use_flash_attention: True add_bias_linear: False eos_token_id: 151645 pad_token_id: 151643 bos_token_id: 151643 attention_dropout: 0.0 # Configurations from MindFormers hidden_dropout: 0.0 input_sliced_sig: True untie_embeddings_and_output_weights: True position_embedding_type: "rope" qk_layernorm: True use_contiguous_weight_layout_attention: False qkv_concat: True # offset: [-1, -1, 1, 1] params_dtype: "float32" compute_dtype: "bfloat16" layernorm_compute_dtype: "float32" softmax_compute_dtype: "float32" rotary_dtype: "float32" residual_dtype: "float32" model_type: "qwen3" architectures: ["Qwen3ForCausalLM"] # Callbacks configuration, reference: https://www.mindspore.cn/mindformers/docs/en/r1.5.0/appendix/conf_files.html?highlight=enable_alltoall#callbacks-configuration callbacks: - type: MFLossMonitor # Prints training progress information # - type: CheckpointMonitor # Saves model weights during training # prefix: "qwen3" # Prefix for saved file names # save_checkpoint_steps: 5000 # Interval steps for saving model weights # keep_checkpoint_max: 1 # Maximum number of saved model weight files # integrated_save: False # Whether to aggregate weights for saving # async_save: False # Whether to save model weights asynchronously # checkpoint_format: "safetensors" # Format for saving checkpoints # Wrapper cell configuration runner_wrapper: type: MFTrainOneStepCell scale_sense: 1.0 use_clip_grad: True profile: False profile_start_step: 1 profile_stop_step: 10 init_start_profile: False profile_communication: False profile_memory: True layer_scale: False layer_decay: 0.65 lr_scale_factor: 256 ``` 预期结果: 该训练任务配置虽未经过性能调优,但训练吞吐远低于预期。 预期调优后达到MFU 45%+。
问题描述: 单机8卡910B预训练Qwen3-8B模型性能较差,希望找到性能瓶颈所在。 从日志可以看出train_throughput_per_npu的训练吞吐只有不到20TFLOPs,折算得到MFU指标不足10%。  所用环境: ● Atlas 910B3 ● Ubuntu 22.04.5 LTS ● Python 3.11.4 ● Driver 24.1.rc2.1 ● CANN 8.2.RC1 ● MindSpore 2.7.0_20250919142025 ● MindFormers (master:a3ac66,基于mcore) 重现步骤: 预训练任务yaml如下: ```yaml seed: 42 output_dir: './output' load_checkpoint: '' load_ckpt_format: 'safetensors' src_strategy_path_or_dir: '' auto_trans_ckpt: False # If true, automatically transforms the loaded checkpoint for distributed model compatibility only_save_strategy: False resume_training: False use_parallel: True run_mode: 'train' use_legacy: False # Trainer configuration trainer: type: CausalLanguageModelingTrainer model_name: 'Qwen3' # Runner configuration runner_config: epochs: 1 batch_size: 1 gradient_accumulation_steps: 1 # Optimizer configuration optimizer: type: AdamW betas: [0.9, 0.95] eps: 1.e-8 weight_decay: 0.0 # Learning rate scheduler configuration lr_schedule: type: ConstantWarmUpLR learning_rate: 1.e-6 warmup_ratio: 0 total_steps: -1 # -1 indicates using the total steps from the dataset # Dataset configuration train_dataset: &train_dataset data_loader: type: BlendedMegatronDatasetDataLoader datasets_type: "GPTDataset" sizes: - 8000 # Number of samples in the training set - 0 # Number of samples in the test set (currently unsupported) - 0 # Number of samples in the evaluation set (currently unsupported) config: seed: 1234 # Random seed for data sampling split: "1, 0, 0" # Proportions for training, test, and evaluation sets (test/eval currently unsupported) seq_length: 4096 # Sequence length of the dataset eod_mask_loss: False # Whether to calculate loss at the end-of-document (EOD) reset_position_ids: False # Whether to reset position_ids at EOD create_attention_mask: True # Whether to include attention_mask in the dataset reset_attention_mask: False # Whether to reset attention_mask at EOD, creating a stepped attention_mask create_compressed_eod_mask: False # Whether to include a compressed attention_mask eod_pad_length: 128 # Length of the compressed attention_mask eod: 1 # Token ID for EOD in the dataset pad: -1 # Token ID for padding in the dataset data_path: # Sampling proportion and path for the Megatron dataset - '1' - "./datasets/wiki103-megatron_text_document" input_columns: ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"] construct_args_key: ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"] num_parallel_workers: 8 python_multiprocessing: False drop_remainder: True numa_enable: False prefetch_size: 1 seed: 1234 train_dataset_task: type: CausalLanguageModelDataset dataset_config: *train_dataset # MindSpore context initialization configuration, reference: https://www.mindspore.cn/docs/en/r2.6.0/api_python/mindspore/mindspore.set_context.html context: mode: 0 # 0--Graph Mode; 1--Pynative Mode device_target: "Ascend" # Target device to run (only supports "Ascend") max_device_memory: "59GB" # Maximum memory available for the device memory_optimize_level: "O0" # Memory optimization level jit_config: # Global JIT configuration for compilation jit_level: "O1" # Compilation optimization level ascend_config: # Parameters specific to the Ascend hardware platform # precision_mode: "must_keep_origin_dtype" # Mixed precision mode setting parallel_speed_up_json_path: "./configs/qwen3/parallel_speed_up.json" # Path to the parallel speedup JSON file # Parallel configuration parallel_config: data_parallel: &dp 4 # Number of data parallel model_parallel: 2 # Number of model parallel pipeline_stage: 1 # Number of pipeline parallel micro_batch_num: 1 # Pipeline parallel microbatch size use_seq_parallel: False # Whether to enable sequence parallelism gradient_aggregation_group: 1 # Size of the gradient communication operator fusion group # When model_parallel > 1, setting micro_batch_interleave_num to 2 may accelerate the training process. micro_batch_interleave_num: 1 # Parallel context configuration parallel: parallel_mode: 1 # 0--data parallel; 1--semi-auto parallel; 2--auto parallel; 3--hybrid parallel enable_alltoall: True # Enables AllToAll communication operator during parallel communication full_batch: False # Whether to load the full batch of data in parallel mode dataset_strategy: [ [*dp, 1], [*dp, 1], [*dp, 1], [*dp, 1], [*dp, 1, 1, 1] ] # Must match the length of train_dataset.input_columns search_mode: "sharding_propagation" # Fully-automatic parallel strategy search mode strategy_ckpt_config: save_file: "./ckpt_strategy.ckpt" # Path for saving the parallel slicing strategy file only_trainable_params: False # Whether to save/load slicing strategy for trainable parameters only enable_parallel_optimizer: True # Whether to enable optimizer parallelism parallel_optimizer_config: gradient_accumulation_shard: False parallel_optimizer_threshold: 64 optimizer_weight_shard_size: 4 # Recomputation configuration recompute_config: recompute: False select_recompute: False parallel_optimizer_comm_recompute: False mp_comm_recompute: False # Model configuration model: model_config: # Configurations from Hugging Face vocab_size: 151936 hidden_size: 4096 intermediate_size: 12288 num_hidden_layers: 36 num_attention_heads: 32 num_key_value_heads: 8 head_dim: 128 hidden_act: 'swiglu' max_position_embeddings: 4096 seq_length: 4096 initializer_range: 0.02 rms_norm_eps: 1.e-6 use_cache: True tie_word_embeddings: False rope_theta: 1000000. attention_bias: False use_flash_attention: True add_bias_linear: False eos_token_id: 151645 pad_token_id: 151643 bos_token_id: 151643 attention_dropout: 0.0 # Configurations from MindFormers hidden_dropout: 0.0 input_sliced_sig: True untie_embeddings_and_output_weights: True position_embedding_type: "rope" qk_layernorm: True use_contiguous_weight_layout_attention: False qkv_concat: True # offset: [-1, -1, 1, 1] params_dtype: "float32" compute_dtype: "bfloat16" layernorm_compute_dtype: "float32" softmax_compute_dtype: "float32" rotary_dtype: "float32" residual_dtype: "float32" model_type: "qwen3" architectures: ["Qwen3ForCausalLM"] # Callbacks configuration, reference: https://www.mindspore.cn/mindformers/docs/en/r1.5.0/appendix/conf_files.html?highlight=enable_alltoall#callbacks-configuration callbacks: - type: MFLossMonitor # Prints training progress information # - type: CheckpointMonitor # Saves model weights during training # prefix: "qwen3" # Prefix for saved file names # save_checkpoint_steps: 5000 # Interval steps for saving model weights # keep_checkpoint_max: 1 # Maximum number of saved model weight files # integrated_save: False # Whether to aggregate weights for saving # async_save: False # Whether to save model weights asynchronously # checkpoint_format: "safetensors" # Format for saving checkpoints # Wrapper cell configuration runner_wrapper: type: MFTrainOneStepCell scale_sense: 1.0 use_clip_grad: True profile: False profile_start_step: 1 profile_stop_step: 10 init_start_profile: False profile_communication: False profile_memory: True layer_scale: False layer_decay: 0.65 lr_scale_factor: 256 ``` 预期结果: 该训练任务配置虽未经过性能调优,但训练吞吐远低于预期。 预期调优后达到MFU 45%+。
评论 (
1
)
登录
后才可以发表评论
状态
TODO
TODO
ACCEPTED
WIP
VALIDATION
DONE
CLOSED
REJECTED
负责人
未设置
标签
未设置
项目
未立项任务
未立项任务
里程碑
未关联里程碑
未关联里程碑
Pull Requests
未关联
未关联
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
未关联
分支 (42)
标签 (23)
master
br_feature_llm_trainer
r1.7.0
r1.6.0
br_infer_boom_1115
bugfix/r1.7.0/value_issue
br_feature_pynative
r1.7.0-beta3
br_feature_infer
r1.7.0-beta1
br_infer_boom
dev
br_infer_deepseek_os
r1.5.0
br_feature_checkpoint
br_feature_infer_300iduo
br_feature_mcore
r1.6.0-beta1
br_infer_deepseek_ep
br_feature_rl_dpo
r1.3.0
r1.3.1
r1.4.0-beta2
r1.4.0-beta1
r1.5.0-beta1
r1.2.0
r1.1.0
r1.1.0-infer
r1.1.rc1
r1.0
kbk-infer
r1.0.a
r0.8
r0.7
r0.6.1_demo
r0.6
0.6rc1
r0.3
r0.2
v0.1.2
v0.1.1
v0.1.0
v1.7.0
v1.7.0-beta3
v1.7.0-beta2
v1.6.0
v1.6.0-beta1
v1.5.0
v1.5.0-beta2
v1.5.0-beta1
v1.4.0-beta2
v1.3.2
v1.3.1-beta1
v1.4.0-beta1
v1.3.0
v1.2.0
v1.1.0
v1.0.2
v1.0.1
v1.0.0
v0.6.0
v0.3
v0.2_rc
v0.1.1
v0.1.0
开始日期   -   截止日期
-
置顶选项
不置顶
置顶等级:高
置顶等级:中
置顶等级:低
优先级
不指定
严重
主要
次要
不重要
预计工期
(小时)
参与者(1)
Python
1
https://gitee.com/mindspore/mindformers.git
git@gitee.com:mindspore/mindformers.git
mindspore
mindformers
mindformers
点此查找更多帮助
搜索帮助
Git 命令在线学习
如何在 Gitee 导入 GitHub 仓库
Git 仓库基础操作
企业版和社区版功能对比
SSH 公钥设置
如何处理代码冲突
仓库体积过大,如何减小?
如何找回被删除的仓库数据
Gitee 产品配额说明
GitHub仓库快速导入Gitee及同步更新
什么是 Release(发行版)
将 PHP 项目自动发布到 packagist.org
评论
仓库举报
回到顶部
登录提示
该操作需登录 Gitee 帐号,请先登录后再操作。
立即登录
没有帐号,去注册