深度神经网络模型并行自适应计算任务调度方法

doi:10.13229/j.cnki.jdxbgxb.20230164

摘要/Abstract

摘要：

针对大规模深度神经网络模型并行面临的内存消耗大、设备利用率低、训练时间长、模型难以收敛的问题，提出了一种面向深度神经网络模型并行的计算任务自适应调度方法。通过建立模型并行计算任务的多迭代异步并行管理机制，控制微批量单元具体调度过程，实现模型合理分区和计算资源合理分配，解决异步迭代时产生的梯度延迟更新问题；基于拓扑感知设计计算资源的分配机制，实现模型训练任务和计算资源的合理匹配；设计计算资源和模型任务的运行时调度策略，实现深度学习模型训练过程中计算与通信重叠的最大化，提高计算资源利用率。实验结果表明：与已有的模型并行方法相比，本文方法可以充分利用各GPU计算资源，在保证模型训练精度的同时，可以将大规模深度神经网络模型训练速度平均提高2.8倍。

关键词: 并行计算, 深度神经网络模型并行, 流水线并行, 异步并行, 任务调度, 计算通信重叠

Abstract:

Aiming at the problems of large memory consumption， low equipment utilization， long training time and difficult convergence that occurred in the training process of the large-scale deep neural network （DNN） model， an adaptive parallel task scheduling method for the large-scale DNN model was proposed. Firstly， a multi-iteration asynchronous parallel management mechanism for model parallel was established， the specific scheduling process of micro-batch units was controlled to realize rational model partitioning and allocation of computing resources， and solve the problem of gradient delay updating during asynchronous iteration. Secondly， a computing resource allocation mechanism was designed based on topology awareness to achieve the best matching between model training tasks and computing resources. Finally， the runtime scheduling strategy for computing resources and model tasks is designed to maximize the overlap between computation and communication in the training process of fine-grained deep learning model， and improve the utilization of computing resources. Experimental results show that， compared with the existing model parallel methods， the proposed scheduling strategy can make full use of the computing resources of each GPU， and improve the training speed of large-scale DNN models by 2.8 times on average while ensuring the training accuracy of the model.

Key words: parallel computing, deep neural network model parallelism, pipeline parallelism, asynchronous parallelism, task scheduling, computation-communication overlap

中图分类号:

TP311

巨涛,刘帅,火久元,张学军. 深度神经网络模型并行自适应计算任务调度方法[J]. 吉林大学学报(工学版), 2024, 54(12): 3601-3613.

Tao JU,Shuai LIU,Jiu-yuan HUO,Xue-jun ZHANG. Adaptive scheduling of computing tasks for deep neural network model parallelism[J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(12): 3601-3613.

图/表 16

图1

图2

图3

图4

图5

图6

表1

图7

图8

图9

图10

表2

图11

图12

图13

图14

参考文献 28

1	朱泓睿, 元国军, 姚成吉, 等. 分布式深度学习训练网络综述[J]. 计算机研究与发展, 2021, 58(1): 98-115.
	Zhu Hong-rui, Yuan Guo-jun, Yao Cheng-ji, et al. Survey on network of distributed deep learning training[J]. Journal of Computer Research and Development, 2021, 58(1): 98-115.
2	巨涛, 赵宇阳, 刘帅, 等. 面向图片识别的深度学习模型并行优化方法[J]. 西安交通大学学报, 2023(1): 141-151.
	Ju Tao, Zhao Yu-yang, Liu Shuai, et al. A parallel optimization method of deep learning model for image recognition[J]. Journal of Xi'an Jiaotong University, 2023(1):141-151.
3	Real E, Aggarwal A, Huang Y P, et al. Regularized evolution for image classifier architecture search[C]∥Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, USA, 2019: 4780-4789.
4	Zoph B, Vasudevan V, Shlens J, et al. Learning transferable architectures for scalable image recognition[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 8697-8710.
5	Devlin J, Chang M W, Lee K, et al. Bert: pre-training of deep bidirectional transformers for language understanding[DB/OL]. [2023-02-15].
6	Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in Neural Information Processing Systems, 2020, 33: 1877-1901.
7	Li M, Andersen D G, Park J W, et al. Scaling distributed machine learning with the parameter server[C]∥Proceedings of the 2014 International Conference on Big Data Science and Computing,Stanford, USA, 2014: 583-598.
8	朱虎明, 李佩, 焦李成, 等. 深度神经网络并行化研究综述[J]. 计算机学报, 2018, 41(8): 1861-1881.
	Zhu Hu-ming, Li Pei, Jiao Li-cheng, et al. Review of parallel deep neural network[J]. Chinese Journal of Computers, 2018, 41(8): 1861-1881.
9	巨涛, 刘帅, 王志强, 等. 深度神经网络模型任务切分及并行优化方法[J]. 北京航空航天大学学报, 2024, 50(9): 2739-2752.
	Ju Tao, Liu Shuai, Wang Zhi-qiang, et al. Task segmentation and parallel optimization of DNN model[J]. Journal of Beijing University of Aeronautics and Astronautics, 2024, 50(9): 2739-2752.
10	Shoeybi M, Patwary M, Puri R, et al. Megatron-lm: training multi-billion parameter language models using model parallelism[DB/OL].[2023-02-15].
11	Lepikhin D, Lee H J, Xu Y H, et al. Gshard: scaling giant models with conditional computation and automatic sharding[DB/OL].[2023-02-15].
12	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in Neural Information Processing Systems, 2017, 30: 6000-6010.
13	Geng J K, Li D, Wang S. Elasticpipe: an efficient and dynamic model-parallel solution to dnn training[C]∥Proceedings of the 10th Workshop on Scientific Cloud Computing, Phoenix AZ, USA, 2019:5-9.
14	Rasley J, Rajbhandari S, Ruwase O, et al. Deepspeed: system optimizations enable training deep learning models with over 100 billion parameters[C]∥Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, USA, 2020: 3505-3506.
15	Huang Y P, Cheng Y L, Bapna A, et al. GPipe: efficient training of giant neural networks using pipeline parallelism[J]. Advances in Neural Information Processing Systems, 2019, 32: 103-112.
16	Shazeer N, Cheng Y L, Parmar N, et al. Mesh-tensorflow: deep learning for supercomputers[J]. Advances in Neural Information Processing Systems, 2018, 31: 10435-10444.
17	Kim C, Lee H, Jeong M, et al. torchgpipe: on-the-fly pipeline parallelism for training giant models[DB/OL].[2023-2-15].
18	Narayanan D, Phanishayee A, Shi K, et al. Memory-efficient pipeline-parallel DNN training[C]∥Proceedings of the 38th International Conference on Machine Learning, Online, 2021: 7937-7947.
19	Fan S Q, Rong Y, Meng C, et al. DAPPLE: a pipelined data parallel approach for training large models[C]∥Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Virtual Event, Republic of Korea, 2021: 431-445.
20	Jangda A, Huang J, Liu G D, et al. Breaking the computation and communication abstraction barrier in distributed machine learning workloads[C]∥Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 2022: 402-416.
21	Guan L, Yin W T, Li D S, et al. XPipe: efficient pipeline model parallelism for multi-GPU DNN training[DB/OL].[2023-02-15].
22	Harlap A, Narayanan D, Phanishayee A, et al. Pipedream: fast and efficient pipeline parallel DNN training[DB/OL]. [2023-02-15].
23	Zhao S X, Li F X, Chen X S, et al. vPipe: a virtualized acceleration system for achieving efficient and scalable pipeline parallel DNN training[J]. IEEE Transactions on Parallel and Distributed Systems, 2021, 33(3): 489-506.
24	Athlur S, Saran N, Sivathanu M, et al. Varuna: scalable, low-cost training of massive deep learning models[C]∥Proceedings of the Seventeenth European Conference on Computer Systems, Rennes, France, 2022: 472-487.
25	Li S, Zhao Y L, Varma R, et al. Pytorch distributed: experiences on accelerating data parallel training[DB/OL]. [2023-02-15].
26	He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016: 770-778.
27	Micikevicius P, Narang S, Alben J, et al. Mixed precision training[DB/OL].[2023-02-15].
28	Rajbhandari S, Ruwase O, Rasley J, et al. Zero-infinity: breaking the gpu memory wall for extreme scale deep learning[C]∥Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis Missouri, USA, 2021: No.07857.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

名称	相关参数
CPU	Intel Xeon Gold 6354 @ 3.00 GHz，18核/芯片，高速缓存39 MB，双路共36核
内存	256 G
GPU	NVIDIA HGX A100，FP64 9.7TFLOPS，FP32 19.5TFLOPS
显存	80 GB
显存带宽	1935 GB/s
GPU接口	PCIE4.0
PyTorch	1.10.0
Python	3.7
CUDA	11.1
cuDNN	8.0.5

基准模型	Batch size	GPipe	PipeDream	PP
ResNet-101	256	181.766	313.642	352.868
	512	201.789	522.467	586.262
	1 024	294.739	587.518	616.674
	2 048	378.746	631.273	751.640
AmoebaNet-36	256	165.487	384.456	343.528
	512	225.133	463.107	414.082
	768	352.731	582.505	697.623
	1 024	406.290	760.703	882.807

[1]	兰凤崇,李继文,陈吉清. 面向动态场景复合深度学习与并行计算的DG-SLAM算法[J]. 吉林大学学报(工学版), 2021, 51(4): 1437-1446.
[2]	张笑东,夏筱筠,吕海峰,公绪超,廉梦佳. 大数据网络并行计算环境中生理数据流动态负载均衡[J]. 吉林大学学报(工学版), 2020, 50(1): 247-254.
[3]	许岩岩, 陈辉, 刘家驹, 袁金钊. CELL处理器并行实现立体匹配算法[J]. 吉林大学学报(工学版), 2017, 47(3): 952-958.
[4]	宋康, 陈潇凯, 林逸. 动力总成悬置系统的稳健设计[J]. 吉林大学学报(工学版), 2016, 46(3): 692-699.
[5]	韩成, 张超, 秦贵和, 薛耀红, 杨帆, 范静涛, 刘文静. 大型正交多幕投影系统光辐射补偿算法[J]. 吉林大学学报(工学版), 2015, 45(4): 1266-1273.
[6]	李军, 倪宏, 王玲芳, 陈君. 流媒体系统中基于请求迁移的任务调度算法[J]. 吉林大学学报(工学版), 2015, 45(3): 938-945.
[7]	刘苗^1,2,王珂¹, 丛玉良¹. 认知无线电中基于分派问题模型的优化PAPR算法[J]. 吉林大学学报(工学版), 2011, 41(6): 1788-1792.