吉林大学学报(理学版) ›› 2025, Vol. 63 ›› Issue (5): 1356-1365.

• • 上一篇    下一篇

面向数据并行深度学习的准确率感知稀疏梯度融合算法

李洪亮, 张蒙, 王子琛, 李想   

  1. 吉林大学 计算机科学与技术学院, 长春 130012
  • 收稿日期:2024-07-02 出版日期:2025-09-26 发布日期:2025-09-26
  • 通讯作者: 李想 E-mail:lxiang@jlu.edu.cn

Accuracy-Aware Sparse Gradient Fusion Algorithm for Data-Parallel Deep Learning

LI Hongliang, ZHANG Meng, WANG Zichen, LI Xiang   

  1. College of Computer Science and Technology, Jilin University, Changchun 130012, China
  • Received:2024-07-02 Online:2025-09-26 Published:2025-09-26

摘要: 针对数据并行的深度学习作业中梯度同步导致的性能瓶颈问题, 提出一种动态的稀疏梯度融合算法. 该算法将梯度压缩、 流水线技术与张量融合技术进行协同建模, 建立稀疏梯度融合行为对准确率影响的理论模型, 并基于此寻找加快梯度同步的同时提高验证准确率的梯度融合方案, 以解决稀疏梯度融合导致验证准确率不稳定的问题. 实验结果表明, 该稀疏梯度融合算法比分层稀疏化方法缩短了1.63倍的通信时间, 比已有的稀疏梯度融合算法缩短了2.68倍的收敛时间.

关键词: 并行深度学习, 梯度稀疏化, 张量融合, 通信流水线技术

Abstract: Aiming at the  problem of  the performance bottleneck caused by gradient synchronization in data-parallel deep learning tasks, we proposed a dynamic sparse gradient fusion algorithm. The  algorithm synergistically modelled  gradient compression, pipeine techniques, and tensor fusion technology to establish  a theoretical model of the impact of sparse gradient fusion behavior on accuracy. Based on this, the  gradient fusion scheme was found to accelerate gradient synchronization while improving validation accuracy, so as to solve the problem of unstable validation accuracy caused by sparse gradient fusion. Experimental results show that the sparse gradient fusion algorithm reduces communication time by  1.63 times  compared to layer-wise sparsification method, and reduces convergence time by  2.68 times compared to existing sparse gradient fusion algorithms.

Key words: parallel deep learning, gradient sparsification, tensor fusion, communication pipeline technology

中图分类号: 

  • TP391