Optimizing DLRM by using PyTorch with oneCCL Backend


The modern deep learning models are growing at an exponential rate, and those latest models could grow their parameters from million to billions. To train those modern models within hours, distributed training is a better option for those big models.

Intel® oneAPI Collective Communications Library

The Intel® oneAPI Collective Communications Library (oneCCL) enables developers and researchers to more quickly train newer and deeper models. This is done by using optimized communication patterns to distribute model training across multiple nodes.

  • Built on top of lower-level communication middleware. MPI and libfabrics transparently support many interconnects, such as Intel® Omni-Path Architecture, InfiniBand*, and Ethernet.
  • Optimized for high performance on Intel® CPUs and GPUs.
  • Allows the tradeoff of compute for communication performance to drive scalability of communication patterns.
  • Enables efficient implementations of collectives that are heavily used for neural network training, including all-gather, all-reduce, and reduce-scatter.
Fig. 1 Software stacks for PyTorch DistibutedDataParallel. CCL is one of communication backend options.

DLRM : a new era of deep learning workloads from Facebook

Over the last two years, a lot of research has been published that addresses the fusion of artificial intelligence (AI) and high-performance computing (HPC). While the presented research is focusing on the extreme-scale and HPC aspects, it is often limited to training convolutional neural nets (CNN).

Fig.2 Schematic of the DLRM topology.

Multi-Socket and Multi-Nodes DLRM

The original DLRM code from Facebook has support for device-based hybrid parallelization strategies. It follows data parallelism for MLP and model-parallelism for embeddings. This means that MLP layers are replicated on each agent and the input to the first MLP layer is distributed over minibatch dimension whereas embedding tables are distributed across available agents and each table produces output worth full minibatch.

Multi-Socket / Multi-node DLRM results and related performance benefit from oneCCL

For multi sockets performance numbers, we measured the performance with a 64 sockets cluster in 32 dual-socket nodes with Intel Xeon Platinum 8280 processor. Each socket has 28 CPU cores.

  • The Small variant is identical to the model problem used in DLRM’s release paper [2].
  • Large variant is the small problem scaled in every aspect for scale-out runs and is the best representative for production workloads in terms of actual compute and memory capacity requirements.
  • The MLPerf configuration is recently proposed as a benchmark config for performance evaluation of a recommendation system training [3].
Fig. 4 DLRM strong scaling performance comparison.
Fig. 5 DLRM weak scaling performance comparison.
Fig. 6 : Compute-Communication time break up for Large Config
Fig. 7 : Compute-Communication time break up for MLPerf Config

BFLOAT16 training supported by oneCCL backend on Intel Xeon scalable processors

As a continuation to our CPU optimizations, we explored low precision DLRM training using BFLOAT16 data type that is supported on 3rd generation Intel Scaleable Xeon processors code-named Cooper Lake (CPX).

Fig. 8 Split-SGD BF16 performance


To train moden models within hours, distributed training is a better option for those big models.


[1] Dhiraj Kalamkar , Jianping Chen Evangelos Georganas Sudarshan Srinivasan Mikhail Shiryaev Alexander Heinecke “Optimizing Deep Learning Recommender Systems Training on CPU Cluster Architectures



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


PyTorch is an open source machine learning platform that provides a seamless path from research prototyping to production deployment.