深度神经网络模型并行自适应计算任务调度方法

doi:10.13229/j.cnki.jdxbgxb.20230164

Abstract

Abstract:

Aiming at the problems of large memory consumption， low equipment utilization， long training time and difficult convergence that occurred in the training process of the large-scale deep neural network （DNN） model， an adaptive parallel task scheduling method for the large-scale DNN model was proposed. Firstly， a multi-iteration asynchronous parallel management mechanism for model parallel was established， the specific scheduling process of micro-batch units was controlled to realize rational model partitioning and allocation of computing resources， and solve the problem of gradient delay updating during asynchronous iteration. Secondly， a computing resource allocation mechanism was designed based on topology awareness to achieve the best matching between model training tasks and computing resources. Finally， the runtime scheduling strategy for computing resources and model tasks is designed to maximize the overlap between computation and communication in the training process of fine-grained deep learning model， and improve the utilization of computing resources. Experimental results show that， compared with the existing model parallel methods， the proposed scheduling strategy can make full use of the computing resources of each GPU， and improve the training speed of large-scale DNN models by 2.8 times on average while ensuring the training accuracy of the model.

Key words: parallel computing, deep neural network model parallelism, pipeline parallelism, asynchronous parallelism, task scheduling, computation-communication overlap

CLC Number:

TP311

Tao JU,Shuai LIU,Jiu-yuan HUO,Xue-jun ZHANG. Adaptive scheduling of computing tasks for deep neural network model parallelism[J].Journal of Jilin University(Engineering and Technology Edition), 2024, 54(12): 3601-3613.

Figures/Tables 16

Fig.1

Fig.2

Fig.3

Fig.4

Fig.5

Fig.6

Table 1

Fig.7

Fig.8

Fig.9

Fig.10

Table 2

Fig.11

Fig.12

Fig.13

Fig.14

References 28

1	朱泓睿, 元国军, 姚成吉, 等. 分布式深度学习训练网络综述[J]. 计算机研究与发展, 2021, 58(1): 98-115.
	Zhu Hong-rui, Yuan Guo-jun, Yao Cheng-ji, et al. Survey on network of distributed deep learning training[J]. Journal of Computer Research and Development, 2021, 58(1): 98-115.
2	巨涛, 赵宇阳, 刘帅, 等. 面向图片识别的深度学习模型并行优化方法[J]. 西安交通大学学报, 2023(1): 141-151.
	Ju Tao, Zhao Yu-yang, Liu Shuai, et al. A parallel optimization method of deep learning model for image recognition[J]. Journal of Xi'an Jiaotong University, 2023(1):141-151.
3	Real E, Aggarwal A, Huang Y P, et al. Regularized evolution for image classifier architecture search[C]∥Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, USA, 2019: 4780-4789.
4	Zoph B, Vasudevan V, Shlens J, et al. Learning transferable architectures for scalable image recognition[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 8697-8710.
5	Devlin J, Chang M W, Lee K, et al. Bert: pre-training of deep bidirectional transformers for language understanding[DB/OL]. [2023-02-15].
6	Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in Neural Information Processing Systems, 2020, 33: 1877-1901.
7	Li M, Andersen D G, Park J W, et al. Scaling distributed machine learning with the parameter server[C]∥Proceedings of the 2014 International Conference on Big Data Science and Computing,Stanford, USA, 2014: 583-598.
8	朱虎明, 李佩, 焦李成, 等. 深度神经网络并行化研究综述[J]. 计算机学报, 2018, 41(8): 1861-1881.
	Zhu Hu-ming, Li Pei, Jiao Li-cheng, et al. Review of parallel deep neural network[J]. Chinese Journal of Computers, 2018, 41(8): 1861-1881.
9	巨涛, 刘帅, 王志强, 等. 深度神经网络模型任务切分及并行优化方法[J]. 北京航空航天大学学报, 2024, 50(9): 2739-2752.
	Ju Tao, Liu Shuai, Wang Zhi-qiang, et al. Task segmentation and parallel optimization of DNN model[J]. Journal of Beijing University of Aeronautics and Astronautics, 2024, 50(9): 2739-2752.
10	Shoeybi M, Patwary M, Puri R, et al. Megatron-lm: training multi-billion parameter language models using model parallelism[DB/OL].[2023-02-15].
11	Lepikhin D, Lee H J, Xu Y H, et al. Gshard: scaling giant models with conditional computation and automatic sharding[DB/OL].[2023-02-15].
12	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in Neural Information Processing Systems, 2017, 30: 6000-6010.
13	Geng J K, Li D, Wang S. Elasticpipe: an efficient and dynamic model-parallel solution to dnn training[C]∥Proceedings of the 10th Workshop on Scientific Cloud Computing, Phoenix AZ, USA, 2019:5-9.
14	Rasley J, Rajbhandari S, Ruwase O, et al. Deepspeed: system optimizations enable training deep learning models with over 100 billion parameters[C]∥Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, USA, 2020: 3505-3506.
15	Huang Y P, Cheng Y L, Bapna A, et al. GPipe: efficient training of giant neural networks using pipeline parallelism[J]. Advances in Neural Information Processing Systems, 2019, 32: 103-112.
16	Shazeer N, Cheng Y L, Parmar N, et al. Mesh-tensorflow: deep learning for supercomputers[J]. Advances in Neural Information Processing Systems, 2018, 31: 10435-10444.
17	Kim C, Lee H, Jeong M, et al. torchgpipe: on-the-fly pipeline parallelism for training giant models[DB/OL].[2023-2-15].
18	Narayanan D, Phanishayee A, Shi K, et al. Memory-efficient pipeline-parallel DNN training[C]∥Proceedings of the 38th International Conference on Machine Learning, Online, 2021: 7937-7947.
19	Fan S Q, Rong Y, Meng C, et al. DAPPLE: a pipelined data parallel approach for training large models[C]∥Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Virtual Event, Republic of Korea, 2021: 431-445.
20	Jangda A, Huang J, Liu G D, et al. Breaking the computation and communication abstraction barrier in distributed machine learning workloads[C]∥Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 2022: 402-416.
21	Guan L, Yin W T, Li D S, et al. XPipe: efficient pipeline model parallelism for multi-GPU DNN training[DB/OL].[2023-02-15].
22	Harlap A, Narayanan D, Phanishayee A, et al. Pipedream: fast and efficient pipeline parallel DNN training[DB/OL]. [2023-02-15].
23	Zhao S X, Li F X, Chen X S, et al. vPipe: a virtualized acceleration system for achieving efficient and scalable pipeline parallel DNN training[J]. IEEE Transactions on Parallel and Distributed Systems, 2021, 33(3): 489-506.
24	Athlur S, Saran N, Sivathanu M, et al. Varuna: scalable, low-cost training of massive deep learning models[C]∥Proceedings of the Seventeenth European Conference on Computer Systems, Rennes, France, 2022: 472-487.
25	Li S, Zhao Y L, Varma R, et al. Pytorch distributed: experiences on accelerating data parallel training[DB/OL]. [2023-02-15].
26	He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016: 770-778.
27	Micikevicius P, Narang S, Alben J, et al. Mixed precision training[DB/OL].[2023-02-15].
28	Rajbhandari S, Ruwase O, Rasley J, et al. Zero-infinity: breaking the gpu memory wall for extreme scale deep learning[C]∥Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis Missouri, USA, 2021: No.07857.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 10

[1]	LI Shoutao, LI Yuanchun. Autonomous Mobile Robot Control Algorithm Based on Hierarchical Fuzzy Behaviors in Unknown Environments[J]. 吉林大学学报(工学版), 2005, 35(04): 391 -397 .
[2]	Liu Qing-min，Wang Long-shan，Chen Xiang-wei，Li Guo-fa. Ball nut detection by machine vision[J]. 吉林大学学报(工学版), 2006, 36(04): 534 -538 .
[3]	Li Hong-ying; Shi Wei-guang;Gan Shu-cai. Electromagnetic properties and microwave absorbing property of Z type hexaferrite Ba_3-xLa_xCo₂Fe₂₄O₄₁[J]. 吉林大学学报(工学版), 2006, 36(06): 856 -0860 .
[4]	Zhang Quan-fa，Li Ming-zhe，Sun Gang，Ge Xin . Comparison between flexible and rigid blank-holding in multi-point forming[J]. 吉林大学学报(工学版), 2007, 37(01): 25 -30 .
[5]	. [J]. 吉林大学学报(工学版), 2007, 37(04): 0 .
[6]	Li Yue-ying，Liu Yong-bing，Chen Hua . Surface hardening and tribological properties of a cam materials[J]. 吉林大学学报(工学版), 2007, 37(05): 1064 -1068 .
[7]	Feng Hao，Xi Jian-feng，Jiao Cheng-wu . Placement of roadside traffic signs based on visibility distance[J]. 吉林大学学报(工学版), 2007, 37(04): 782 -785 .
[8]	Zhang He-sheng, Zhang Yi, Wen Hui-min, Hu Dong-cheng . Estimation approaches of average link travel time using GPS data[J]. 吉林大学学报(工学版), 2007, 37(03): 533 -0537 .
[9]	Yang Shu-kai, Song Chuan-xue, An Xiao-juan, Cai Zhang-lin . Analyzing effects of suspension bushing elasticity on vehicle yaw response character with virtual prototype method[J]. 吉林大学学报(工学版), 2007, 37(05): 994 -0999 .
[10]	. [J]. 吉林大学学报(工学版), 2007, 37(06): 1284 -1287 .

名称	相关参数
CPU	Intel Xeon Gold 6354 @ 3.00 GHz，18核/芯片，高速缓存39 MB，双路共36核
内存	256 G
GPU	NVIDIA HGX A100，FP64 9.7TFLOPS，FP32 19.5TFLOPS
显存	80 GB
显存带宽	1935 GB/s
GPU接口	PCIE4.0
PyTorch	1.10.0
Python	3.7
CUDA	11.1
cuDNN	8.0.5

基准模型	Batch size	GPipe	PipeDream	PP
ResNet-101	256	181.766	313.642	352.868
	512	201.789	522.467	586.262
	1 024	294.739	587.518	616.674
	2 048	378.746	631.273	751.640
AmoebaNet-36	256	165.487	384.456	343.528
	512	225.133	463.107	414.082
	768	352.731	582.505	697.623
	1 024	406.290	760.703	882.807

Adaptive scheduling of computing tasks for deep neural network model parallelism

RICH HTML

PDF (PC)

Abstract

Cite this article

share this article

Figures/Tables 16

References 28

Related Articles 5

Metrics

Comments

Recommended 10

[1]	Feng-chong LAN,Ji-wen LI,Ji-qing CHEN. DG-SLAM algorithm for dynamic scene compound deep learning and parallel computing [J]. Journal of Jilin University(Engineering and Technology Edition), 2021, 51(4): 1437-1446.
[2]	Xiao-dong ZHANG,Xiao-jun XIA,Hai-feng LYU,Xu-chao GONG,Meng-jia LIAN. Dynamic load balancing of physiological data flow in big data network parallel computing environment [J]. Journal of Jilin University(Engineering and Technology Edition), 2020, 50(1): 247-254.
[3]	HAN Cheng, ZHANG Chao, QIN Gui-he, XUE Yao-hong, YANG Fan, FAN Jing-tao, LIU Wen-jing. An algorithm of optical radiometric compensation for projective system of large-scale orthogonal multi-screens [J]. 吉林大学学报(工学版), 2015, 45(4): 1266-1273.
[4]	LI Jun, NI Hong, WANG Ling-fang, CHEN Jun. Request migration based task scheduling algorithm in VoD system [J]. 吉林大学学报(工学版), 2015, 45(3): 938-945.
[5]	LIU Miao, WANG Ke, CONG Yu-liang. Optimizing PAPR algorithm for cognitive radio based on assignment model [J]. 吉林大学学报(工学版), 2011, 41(6): 1788-1792.