Two of the most popular machine learning models used today are BERT, for natural language processing (NLP), and Mask R-CNN, for image recognition. Over the past several months, AWS has significantly improved the underlying infrastructure, network, machine learning (ML) framework, and model code to achieve the best training time for these two popular state-of-the-art models. Today, we are excited to share the world’s fastest model training times to date on the cloud on TensorFlow, MXNet, and PyTorch. You can now use these hardware and software optimizations to train your TensorFlow, MXNet, and PyTorch models with the same speed and efficiency.
Model training time directly impacts your ability to iterate and improve on the accuracy of your models quickly. The primary way to reduce training time is by distributing the training job across a large cluster of GPU instances, but this is hard to do efficiently. If you distribute a training job across a large number of workers, you often have rapidly diminishing returns because the overhead in communication between instances begins to cancel out the additional GPU computing power.
BERT, or Bidirectional Encoder Representations from Transformers, is a popular NLP model, which at the time it was published was state-of-the-art on several common NLP tasks.
On a single Amazon EC2 P3dn.24xlarge instance, which has 8 NVIDIA V100 GPUs, it takes approximately three days to train BERT from scratch with TensorFlow and PyTorch. We reduced training time from three days to slightly over 60 minutes by efficiently scaling out to more P3dn.24xlarge instances, using network improvements with Elastic Fabric Adapter (EFA), and optimizing how this complex model converges on larger clusters. As of this writing, this is the fastest time-to-train for BERT on the cloud while achieving state-of-the-art target accuracy (F1 score of 90.4 or higher on Squad v2 tasks after training on BooksCorpus and English Wikipedia).
With TensorFlow, we achieved unprecedented scale with 2,048 GPUs on 256 P3dn.24xlarge instances to train BERT in 62 minutes. With PyTorch, we reduced training time to 69 minutes by scaling out to 1,536 GPUs on 192 P3dn.24xlarge instances. With all our optimizations to the entire hardware and software stack for training BERT, we achieved an 85% scaling efficiency, which makes sure the frameworks can use most of the additional computation power from GPUs when scaling to more P3dn.24xlarge nodes. The following table summarizes these improvements.
|P3DN.24xlarge Nodes||NVIDIA GPUs||Time to train (PyTorch)||Time to train (TensorFlow)|
|1||8||3 days||3 days|
Mask R-CNN is a widely used instance segmentation model that is used for autonomous driving, motion capture, and other uses that require sophisticated object detection and segmentation capabilities.
It takes approximately 80 hours to train Mask R-CNN on a single P3dn.24xlarge instance (8 NVIDIA V100 GPUs) with MXNet, PyTorch, and TensorFlow. We reduced training time from 80 hours to approximately 25 minutes on MXNet, PyTorch, and TensorFlow. We scaled Mask R-CNN training on all three ML frameworks to 24 P3dn.24xlarge instances, which gave us 192 GPUs. You can now rapidly iterate and run several experiments daily instead of waiting several days for results. As of this writing, this is the fastest time-to-train for Mask R-CNN on the cloud, while achieving state-of-the-art target accuracy (0.377 Box min AP, 0.339 Mask min AP on COCO2017 dataset). The following table summarizes these improvements.
|# of Nodes||# of GPUs||Time to train (MXNet)||Time to train (PyTorch)||Time to train (TensorFlow)|
|1||8||~80 hrs||~80 hrs||~80 hrs|
|24||192||25 min||26 min||27 min|
Achieving these results required optimizations to the underlying hardware, networking, and software stack . When training large models such as BERT, communication among the many GPUs in use becomes a bottleneck.
In distributed computing (large-scale training being one instance of it), AllReduce is an operation that reduces arrays (parameters of a neural network in this case) from different workers (GPUs) and returns the resultant array to all workers (GPUs). GPUs collectively perform an AllReduce operation after every iteration. Each iteration consists of one forward and backward pass through the network.
The most common approach to perform AllReduce on GPUs is to use NVIDIA Collective Communications Library (NCCL) or MPI libraries such as OpenMPI or Intel MPI Library. These libraries are designed for homogeneous clusters. AllReduce happens on the same instances that train the network. The AllReduce algorithm on homogeneous clusters involves each worker sending and receiving data approximately twice the size of the model for each AllReduce operation. For example, the AllReduce operation for BERT (which has 340 million parameters) involves sending approximately 650 MB of half-precision data twice and receiving the same amount of data twice. This communication needs to happen after every iteration and quickly becomes a bottleneck when training most models.
The choice of AllReduce algorithm usually depends on the network architecture. For example, Ring-AllReduce is a good choice for a network in which each node is connected to two neighbors, which forms a ring. Torus AllReduce algorithm is a good choice for a network in which each node is connected to four neighbors, which forms a 2D rectangular lattice. AWS uses a much more flexible interconnect, in which any node can communicate with any other node at full bandwidth. For example, in a cluster with 128 P3dn instances, any instance can communicate with any other instance at 100 Gbps.
Also, the 100 Gbps interconnect is not limited to P3dn instances. You can add CPU-optimized C5n instances to the cluster and still retain the 100 Gbps interconnect between any pair of nodes.
This high flexibility of the AWS interconnect begs for an AllReduce algorithm that makes full use of the unique capabilities of the AWS interconnect. We therefore developed a custom AllReduce algorithm optimized for the AWS network. The custom AllReduce algorithm exploits the 100 Gbps interconnect between any pair of nodes in a heterogeneous cluster and reduces the amount of data sent and received by each worker by half. The compute phase of the AllReduce algorithm is offloaded onto compute-optimized C5 instances, freeing up the GPUs to compute gradients faster. Because GPU instances don’t perform the reduction operation, sending gradients and receiving AllReduced gradients can happen in parallel. The number of hops required to AllReduce gradients is reduced to just two compared to homogeneous AllReduce algorithms, in which the number of network hops increases with the number of nodes. The total cost is also reduced because training completes much faster compared to training with only P3dn nodes.
When tested with BERT and Mask R-CNN, the results yielded significant improvements to single-node executions. Throughput scaled almost linearly as the number of P3dn nodes scaled from 1 to 16, 32, 64, 128, 192, and 256 instances, which ultimately helped to reduce model training times by scaling to additional P3dn.24xlarge instances without increasing cost. With these optimizations, AWS can now offer you the fastest model training times on the cloud for state-of-the-art computer vision and NLP models.
Get started with TensorFlow, MXNet, and PyTorch today on Amazon SageMaker.
About the authors
Aditya Bindal is a Senior Product Manager for AWS Deep Learning. He works on products that make it easier for customers to use deep learning engines. In his spare time, he enjoys playing tennis, reading historical fiction, and traveling.
Kevin Haas is the engineering leader for the AWS Deep Learning team, focusing on providing performance and usability improvements for AWS machine learning customers. Kevin is very passionate about lowering the friction for customers to adopt machine learning in their software applications. Outside of work, he can be found dabbling with open source software and volunteering for the Boy Scouts.
Indu Thangakrishnan is a Software Development Engineer at AWS. He works on training deep neural networks faster. In his spare time, he enjoys binge-listening Audible and playing table tennis.