Issue #1 – Scaling Neural MT
18 Jul18
Issue #1 – Scaling Neural MT
Author: Dr. Rohit Gupta, Sr. Machine Translation Scientist @ Iconic
Training a neural machine translation engine is a time consuming task. It typically takes a number of days or even weeks, when running powerful GPUs. Reducing this time is a priority of any neural MT developer. In this post we explore a recent work (Ott et al, 2018), whereby, without compromising the translation quality, they speed up the training 4.9 times on a single machine, and 38.6 times using sixteen such machines.
Key points in this training procedure are: Half-precision floating-points (FP16), Volta GPUs, and Distributed Synchronous SGD (stochastic gradient descent).
Half-precision floating-points (FP16)
Floating-points are how computers store information in memory. FP16 requires half the storage space and half the memory bandwidth of the single-precision floating-point (FP32). Therefore, FP16 computation can be faster on some machines. FP16 has lower precision and smaller range compared to FP32. In general, FP32 has just enough capacity to carry out the computations required in the neural networks and therefore primarily used. FP16, on the other hand, in vanilla settings, have the disadvantage that the gradients can underflow or overflow due to the
To finish reading, please visit source site