吉林大学学报(工学版) ›› 2024, Vol. 54 ›› Issue (12): 3601-3613.doi: 10.13229/j.cnki.jdxbgxb.20230164

• 计算机科学与技术 • 上一篇    下一篇

深度神经网络模型并行自适应计算任务调度方法

巨涛(),刘帅,火久元,张学军   

  1. 兰州交通大学 电子与信息工程学院,兰州 730070
  • 收稿日期:2023-02-23 出版日期:2024-12-01 发布日期:2025-01-24
  • 作者简介:巨涛(1980-),男,副教授,博士.研究方向:并行计算,深度学习并行优化,边缘计算.E-mail:jutao@mail.lzjtu.edu.cn
  • 基金资助:
    国家自然科学基金项目(61862037);兰州市人才创新创业项目(2021-RC-40);兰州交通大学天佑创新团队项目(TY202002)

Adaptive scheduling of computing tasks for deep neural network model parallelism

Tao JU(),Shuai LIU,Jiu-yuan HUO,Xue-jun ZHANG   

  1. School of Electronic and Information Engineering,Lanzhou Jiaotong University,Lanzhou 730070,China
  • Received:2023-02-23 Online:2024-12-01 Published:2025-01-24

摘要:

针对大规模深度神经网络模型并行面临的内存消耗大、设备利用率低、训练时间长、模型难以收敛的问题,提出了一种面向深度神经网络模型并行的计算任务自适应调度方法。通过建立模型并行计算任务的多迭代异步并行管理机制,控制微批量单元具体调度过程,实现模型合理分区和计算资源合理分配,解决异步迭代时产生的梯度延迟更新问题;基于拓扑感知设计计算资源的分配机制,实现模型训练任务和计算资源的合理匹配;设计计算资源和模型任务的运行时调度策略,实现深度学习模型训练过程中计算与通信重叠的最大化,提高计算资源利用率。实验结果表明:与已有的模型并行方法相比,本文方法可以充分利用各GPU计算资源,在保证模型训练精度的同时,可以将大规模深度神经网络模型训练速度平均提高2.8倍。

关键词: 并行计算, 深度神经网络模型并行, 流水线并行, 异步并行, 任务调度, 计算通信重叠

Abstract:

Aiming at the problems of large memory consumption, low equipment utilization, long training time and difficult convergence that occurred in the training process of the large-scale deep neural network (DNN) model, an adaptive parallel task scheduling method for the large-scale DNN model was proposed. Firstly, a multi-iteration asynchronous parallel management mechanism for model parallel was established, the specific scheduling process of micro-batch units was controlled to realize rational model partitioning and allocation of computing resources, and solve the problem of gradient delay updating during asynchronous iteration. Secondly, a computing resource allocation mechanism was designed based on topology awareness to achieve the best matching between model training tasks and computing resources. Finally, the runtime scheduling strategy for computing resources and model tasks is designed to maximize the overlap between computation and communication in the training process of fine-grained deep learning model, and improve the utilization of computing resources. Experimental results show that, compared with the existing model parallel methods, the proposed scheduling strategy can make full use of the computing resources of each GPU, and improve the training speed of large-scale DNN models by 2.8 times on average while ensuring the training accuracy of the model.

Key words: parallel computing, deep neural network model parallelism, pipeline parallelism, asynchronous parallelism, task scheduling, computation-communication overlap

中图分类号: 

  • TP311

图1

改进的流水线并行机制"

图2

微批量单元调度示例"

图3

深度神经网络模型并行计算任务调度总体执行框架"

图4

改进的流水线并行示例(图中数字表示微批量ID)"

图5

模型任务分配及放置"

图6

设备流处理改进"

表1

软硬件配置"

名称相关参数
CPUIntel Xeon Gold 6354 @ 3.00 GHz,18核/芯片,高速缓存39 MB,双路共36核
内存256 G
GPUNVIDIA HGX A100,FP64 9.7TFLOPS,FP32 19.5TFLOPS
显存80 GB
显存带宽1935 GB/s
GPU接口PCIE4.0
PyTorch1.10.0
Python3.7
CUDA11.1
cuDNN8.0.5

图7

ResNet-101流水线并行训练耗时"

图8

ResNet-101流水线并行加速比"

图9

AmoebaNet-36流水线并行训练耗时"

图10

AmoebaNet-36流水线并行加速比"

表2

模型训练平均吞吐量(samples/s)"

基准模型Batch sizeGPipePipeDreamPP
ResNet-101256181.766313.642352.868
512201.789522.467586.262
1 024294.739587.518616.674
2 048378.746631.273751.640
AmoebaNet-36256165.487384.456343.528
512225.133463.107414.082
768352.731582.505697.623
1 024406.290760.703882.807

图11

ResNet-101流水线并行训练精度"

图12

ResNet-101流水线并行训练损失"

图13

AmoebaNet-36流水线并行训练精度"

图14

AmoebaNet-36流水线并行损失"

1 朱泓睿, 元国军, 姚成吉, 等. 分布式深度学习训练网络综述[J]. 计算机研究与发展, 2021, 58(1): 98-115.
Zhu Hong-rui, Yuan Guo-jun, Yao Cheng-ji, et al. Survey on network of distributed deep learning training[J]. Journal of Computer Research and Development, 2021, 58(1): 98-115.
2 巨涛, 赵宇阳, 刘帅, 等. 面向图片识别的深度学习模型并行优化方法[J]. 西安交通大学学报, 2023(1): 141-151.
Ju Tao, Zhao Yu-yang, Liu Shuai, et al. A parallel optimization method of deep learning model for image recognition[J]. Journal of Xi'an Jiaotong University, 2023(1):141-151.
3 Real E, Aggarwal A, Huang Y P, et al. Regularized evolution for image classifier architecture search[C]∥Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, USA, 2019: 4780-4789.
4 Zoph B, Vasudevan V, Shlens J, et al. Learning transferable architectures for scalable image recognition[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 8697-8710.
5 Devlin J, Chang M W, Lee K, et al. Bert: pre-training of deep bidirectional transformers for language understanding[DB/OL]. [2023-02-15].
6 Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in Neural Information Processing Systems, 2020, 33: 1877-1901.
7 Li M, Andersen D G, Park J W, et al. Scaling distributed machine learning with the parameter server[C]∥Proceedings of the 2014 International Conference on Big Data Science and Computing,Stanford, USA, 2014: 583-598.
8 朱虎明, 李佩, 焦李成, 等. 深度神经网络并行化研究综述[J]. 计算机学报, 2018, 41(8): 1861-1881.
Zhu Hu-ming, Li Pei, Jiao Li-cheng, et al. Review of parallel deep neural network[J]. Chinese Journal of Computers, 2018, 41(8): 1861-1881.
9 巨涛, 刘帅, 王志强, 等. 深度神经网络模型任务切分及并行优化方法[J]. 北京航空航天大学学报, 2024, 50(9): 2739-2752.
Ju Tao, Liu Shuai, Wang Zhi-qiang, et al. Task segmentation and parallel optimization of DNN model[J]. Journal of Beijing University of Aeronautics and Astronautics, 2024, 50(9): 2739-2752.
10 Shoeybi M, Patwary M, Puri R, et al. Megatron-lm: training multi-billion parameter language models using model parallelism[DB/OL].[2023-02-15].
11 Lepikhin D, Lee H J, Xu Y H, et al. Gshard: scaling giant models with conditional computation and automatic sharding[DB/OL].[2023-02-15].
12 Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in Neural Information Processing Systems, 2017, 30: 6000-6010.
13 Geng J K, Li D, Wang S. Elasticpipe: an efficient and dynamic model-parallel solution to dnn training[C]∥Proceedings of the 10th Workshop on Scientific Cloud Computing, Phoenix AZ, USA, 2019:5-9.
14 Rasley J, Rajbhandari S, Ruwase O, et al. Deepspeed: system optimizations enable training deep learning models with over 100 billion parameters[C]∥Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, USA, 2020: 3505-3506.
15 Huang Y P, Cheng Y L, Bapna A, et al. GPipe: efficient training of giant neural networks using pipeline parallelism[J]. Advances in Neural Information Processing Systems, 2019, 32: 103-112.
16 Shazeer N, Cheng Y L, Parmar N, et al. Mesh-tensorflow: deep learning for supercomputers[J]. Advances in Neural Information Processing Systems, 2018, 31: 10435-10444.
17 Kim C, Lee H, Jeong M, et al. torchgpipe: on-the-fly pipeline parallelism for training giant models[DB/OL].[2023-2-15].
18 Narayanan D, Phanishayee A, Shi K, et al. Memory-efficient pipeline-parallel DNN training[C]∥Proceedings of the 38th International Conference on Machine Learning, Online, 2021: 7937-7947.
19 Fan S Q, Rong Y, Meng C, et al. DAPPLE: a pipelined data parallel approach for training large models[C]∥Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Virtual Event, Republic of Korea, 2021: 431-445.
20 Jangda A, Huang J, Liu G D, et al. Breaking the computation and communication abstraction barrier in distributed machine learning workloads[C]∥Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 2022: 402-416.
21 Guan L, Yin W T, Li D S, et al. XPipe: efficient pipeline model parallelism for multi-GPU DNN training[DB/OL].[2023-02-15].
22 Harlap A, Narayanan D, Phanishayee A, et al. Pipedream: fast and efficient pipeline parallel DNN training[DB/OL]. [2023-02-15].
23 Zhao S X, Li F X, Chen X S, et al. vPipe: a virtualized acceleration system for achieving efficient and scalable pipeline parallel DNN training[J]. IEEE Transactions on Parallel and Distributed Systems, 2021, 33(3): 489-506.
24 Athlur S, Saran N, Sivathanu M, et al. Varuna: scalable, low-cost training of massive deep learning models[C]∥Proceedings of the Seventeenth European Conference on Computer Systems, Rennes, France, 2022: 472-487.
25 Li S, Zhao Y L, Varma R, et al. Pytorch distributed: experiences on accelerating data parallel training[DB/OL]. [2023-02-15].
26 He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016: 770-778.
27 Micikevicius P, Narang S, Alben J, et al. Mixed precision training[DB/OL].[2023-02-15].
28 Rajbhandari S, Ruwase O, Rasley J, et al. Zero-infinity: breaking the gpu memory wall for extreme scale deep learning[C]∥Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis Missouri, USA, 2021: No.07857.
[1] 兰凤崇,李继文,陈吉清. 面向动态场景复合深度学习与并行计算的DG-SLAM算法[J]. 吉林大学学报(工学版), 2021, 51(4): 1437-1446.
[2] 张笑东,夏筱筠,吕海峰,公绪超,廉梦佳. 大数据网络并行计算环境中生理数据流动态负载均衡[J]. 吉林大学学报(工学版), 2020, 50(1): 247-254.
[3] 许岩岩, 陈辉, 刘家驹, 袁金钊. CELL处理器并行实现立体匹配算法[J]. 吉林大学学报(工学版), 2017, 47(3): 952-958.
[4] 宋康, 陈潇凯, 林逸. 动力总成悬置系统的稳健设计[J]. 吉林大学学报(工学版), 2016, 46(3): 692-699.
[5] 韩成, 张超, 秦贵和, 薛耀红, 杨帆, 范静涛, 刘文静. 大型正交多幕投影系统光辐射补偿算法[J]. 吉林大学学报(工学版), 2015, 45(4): 1266-1273.
[6] 李军, 倪宏, 王玲芳, 陈君. 流媒体系统中基于请求迁移的任务调度算法[J]. 吉林大学学报(工学版), 2015, 45(3): 938-945.
[7] 刘苗1,2,王珂1, 丛玉良1. 认知无线电中基于分派问题模型的优化PAPR算法[J]. 吉林大学学报(工学版), 2011, 41(6): 1788-1792.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 李寿涛, 李元春. 在未知环境下基于递阶模糊行为的移动机器人控制算法[J]. 吉林大学学报(工学版), 2005, 35(04): 391 -397 .
[2] 刘庆民,王龙山,陈向伟,李国发. 滚珠螺母的机器视觉检测[J]. 吉林大学学报(工学版), 2006, 36(04): 534 -538 .
[3] 李红英;施伟光;甘树才 .

稀土六方Z型铁氧体Ba3-xLaxCo2Fe24O41的合成及电磁性能与吸波特性

[J]. 吉林大学学报(工学版), 2006, 36(06): 856 -0860 .
[4] 张全发,李明哲,孙刚,葛欣 . 板材多点成形时柔性压边与刚性压边方式的比较[J]. 吉林大学学报(工学版), 2007, 37(01): 25 -30 .
[5] .

吉林大学学报(工学版)2007年第4期目录

[J]. 吉林大学学报(工学版), 2007, 37(04): 0 .
[6] 李月英,刘勇兵,陈华 . 凸轮材料的表面强化及其摩擦学特性
[J]. 吉林大学学报(工学版), 2007, 37(05): 1064 -1068 .
[7] 冯浩,席建锋,矫成武 . 基于前视距离的路侧交通标志设置方法[J]. 吉林大学学报(工学版), 2007, 37(04): 782 -785 .
[8] 张和生,张毅,温慧敏,胡东成 . 利用GPS数据估计路段的平均行程时间[J]. 吉林大学学报(工学版), 2007, 37(03): 533 -0537 .
[9] 杨树凯,宋传学,安晓娟,蔡章林 . 用虚拟样机方法分析悬架衬套弹性对
整车转向特性的影响
[J]. 吉林大学学报(工学版), 2007, 37(05): 994 -0999 .
[10] 冯金巧;杨兆升;张林;董升 . 一种自适应指数平滑动态预测模型[J]. 吉林大学学报(工学版), 2007, 37(06): 1284 -1287 .