Trinity: Neural Network Adaptive Distributed Parallel Training Method Based on Reinforcement Learning

Yan Zeng

Jiyang Wu

Jilin Zhang

Yongjian Ren and Yunquan Zhang

Resumen

Deep learning, with increasingly large datasets and complex neural networks, is widely used in computer vision and natural language processing. A resulting trend is to split and train large-scale neural network models across multiple devices in parallel, known as parallel model training. Existing parallel methods are mainly based on expert design, which is inefficient and requires specialized knowledge. Although automatically implemented parallel methods have been proposed to solve these problems, these methods only consider a single optimization aspect of run time. In this paper, we present Trinity, an adaptive distributed parallel training method based on reinforcement learning, to automate the search and tuning of parallel strategies. We build a multidimensional performance evaluation model and use proximal policy optimization to co-optimize multiple optimization aspects. Our experiment used the CIFAR10 and PTB datasets based on InceptionV3, NMT, NASNet and PNASNet models. Compared with Google?s Hierarchical method, Trinity achieves up to 5% reductions in runtime, communication, and memory overhead, and up to a 40% increase in parallel strategy search speeds.

Palabras claves

distributed machine learning - deep learning - reinforcement learning

Acceso

PÁGINAS

pp. 0 - 0

NÚMERO

Volumen: 15 Parte: 4 (2022)

MATERIAS

INGENIERÍA Y CONSTRUCCIÓN CIVIL
TECNOLOGÍA

REVISTAS SIMILARES

DOI

https://doi.org/10.3390/a15040108

Trinity: Neural Network Adaptive Distributed Parallel Training Method Based on Reinforcement Learning

Artículos similares

Revistas destacadas