- “Monarch: Expressive Structured Matrices for Efficient and Accurate Training”, Dao et al 2022
- “Pathways: Asynchronous Distributed Dataflow for ML”, Barham et al 2022
- “Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads”, Shukla et al 2022
- “Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam”, Lu et al 2022
- “Introducing the AI Research SuperCluster—Meta’s Cutting-edge AI Supercomputer for AI Research”, Lee & Sengupta 2022
- “Is Programmable Overhead Worth The Cost? How Much Do We Pay for a System to Be Programmable? It Depends upon Who You Ask”, Bailey 2022
- “SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient”, Ryabinin et al 2021
- “M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining”, Lin et al 2021
- “China Has Already Reached Exascale—On Two Separate Systems”, Hemsoth 2021
- “PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management”, Fang et al 2021
- “Chimera: Efficiently Training Large-Scale Neural Networks With Bidirectional Pipelines”, Li & Hoefler 2021
- “First-Generation Inference Accelerator Deployment at Facebook”, Anderson et al 2021
- “Ten Lessons From Three Generations Shaped Google’s TPUv4i”, Jouppi et al 2021
- “ChinAI #141: The PanGu Origin Story: Notes from an Informative Zhihu Thread on PanGu”, Ding 2021
- “GSPMD: General and Scalable Parallelization for ML Computation Graphs”, Xu et al 2021
- “PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models With Auto-parallel Computation”, Zeng et al 2021
- “Podracer Architectures for Scalable Reinforcement Learning”, Hessel et al 2021
- “High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models (DLRMs)”, Mudigere et al 2021
- “Efficient Large-Scale Language Model Training on GPU Clusters”, Narayanan et al 2021
- “Large Batch Simulation for Deep Reinforcement Learning”, Shacklett et al 2021
- “TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models”, Li et al 2021
- “PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers”, He et al 2021
- “Hardware Beyond Backpropagation: a Photonic Co-Processor for Direct Feedback Alignment”, Launay et al 2020
- “Exploring the Limits of Concurrency in ML Training on Google TPUs”, Kumar et al 2020
- “Training EfficientNets at Supercomputer Scale: 83% ImageNet Top-1 Accuracy in One Hour”, Wongpanich et al 2020
- “Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws?”, Domke et al 2020
- “Interlocking Backpropagation: Improving Depthwise Model-parallelism”, Gomez et al 2020
- “Are We in an AI Overhang?”, Jones 2020
- “Sample Factory: Egocentric 3D Control from Pixels at 100,000 FPS With Asynchronous Reinforcement Learning”, Petrenko et al 2020
- “There’s Plenty of Room at the Top: What Will Drive Computer Performance After Moore’s Law?”, Leiserson et al 2020
- “A Domain-specific Supercomputer for Training Deep Neural Networks”, Jouppi et al 2020
- “Microsoft Announces New Supercomputer, Lays out Vision for Future AI Work”, Langston 2020
- “Computation in the Human Cerebral Cortex Uses Less Than 0.2 Watts yet This Great Expense Is Optimal When considering Communication Costs”, Levy & Calvert 2020
- “Pipelined Backpropagation at Scale: Training Large Models without Batches”, Kosson et al 2020
- “2019 Recent Trends in GPU Price per FLOPS”, Bergal 2020
- “Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos”, Lin et al 2019
- “AI and Compute”, Amodei et al 2018
- “When Will Computer Hardware Match the Human Brain?”, Moravec 1998
Large neural networks excel in many domains, but they are expensive to train and fine-tune. A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones (eg. sparse, low-rank, Fourier transform). These methods have not seen widespread adoption (1) in end-to-end training due to unfavorable efficiency–quality tradeoffs, and (2) in dense-to-sparse fine-tuning due to lack of tractable algorithms to approximate a given dense weight matrix. To address these issues, we propose a class of matrices (Monarch) that is hardware-efficient (they are parameterized as products of two block-diagonal matrices for better hardware utilization) and expressive (they can represent many commonly used transforms). Surprisingly, the problem of approximating a dense weight matrix with a Monarch matrix, though nonconvex, has an analytical optimal solution. These properties of Monarch matrices unlock new ways to train and fine-tune sparse and dense models. We empirically validate that Monarch can achieve favorable accuracy-efficiency tradeoffs in several end-to-end sparse training applications: speeding up ViT and GPT-2 training on ImageNet classification and Wikitext-103 language modeling by 2× with comparable model quality, and reducing the error on PDE solving and MRI reconstruction tasks by 40%. In sparse-to-dense training, with a simple technique called “reverse sparsification”, Monarch matrices serve as a useful intermediate representation to speed up GPT-2 pretraining on OpenWebText by 2× without quality drop. The same technique brings 23% faster BERT pretraining than even the very optimized implementation from Nvidia that set the MLPerf 1.1 record. In dense-to-sparse fine-tuning, as a proof-of-concept, our Monarch approximation algorithm speeds up BERT fine-tuning on GLUE by 1.7× with comparable accuracy.
“Pathways: Asynchronous Distributed Dataflow for ML”, (2022-03-23; ; similar):
We present the design of a new large scale orchestration layer for accelerators.
Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models. Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects. Pathways makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane. This design, with careful engineering, allows Pathways to adopt a single-controller model that makes it easier to express complex new parallelism patterns.
We demonstrate that Pathways can achieve performance parity (~100% accelerator utilization) with state-of-the-art systems when running SPMD [single program multiple data] computations over 2×1024 = 2048 TPUv3s [97% utilization training a 128b T5 model], while also delivering throughput comparable to the SPMD case for Transformer models that are pipelined across 16 stages, or sharded across two islands of accelerators connected over a data center network.
Lowering costs by driving high utilization across deep learning workloads is a crucial lever for cloud providers. We present Singularity, Microsoft’s globally distributed scheduling service for highly-efficient and reliable execution of deep learning training and inference workloads. At the heart of Singularity is a novel, workload-aware scheduler that can transparently preempt and elastically scale deep learning workloads to drive high utilization without impacting their correctness or performance, across a global fleet of AI accelerators (eg. GPUs, FPGAs).
All jobs in Singularity are preemptible, migratable, and dynamically resizable (elastic) by default: a live job can be dynamically and transparently (a) preempted and migrated to a different set of nodes, cluster, data center or a region and resumed exactly from the point where the execution was preempted, and (b) resized (ie. elastically scaled-up/down) on a varying set of accelerators of a given type. Our mechanisms are transparent in that they do not require the user to make any changes to their code or require using any custom libraries that may limit flexibility. Additionally, our approach significantly improves the reliability of deep learning workloads. We show that the resulting efficiency and reliability gains with Singularity are achieved with negligible impact on the steady-state performance. Finally, our design approach is agnostic of DNN architectures and handles a variety of parallelism strategies (eg. data/pipeline/model parallelism).
1-bit communication is an effective method to scale up model training, and has been studied extensively on SGD. Its benefits, however, remain an open question on Adam-based model training (eg. BERT and GPT).
In this paper, we propose 0/1 Adam, which improves upon the state-of-the-art 1-bit Adam via two novel designs: (1) adaptive variance state freezing, which eliminates the requirement of running expensive full-precision communication at early stage of training; (2) 1-bit sync, which allows skipping communication rounds with bit-free synchronization over Adam’s optimizer states, momentum and variance.
In theory, we provide convergence analysis for 0/1 Adam on smooth non-convex objectives, and show the complexity bound is better than original Adam under certain conditions.
On various benchmarks such as BERT-Base/Large pretraining and ImageNet, we demonstrate on up to 128 GPUs that 0/1 Adam is able to reduce up to 87% of data volume, 54% of communication rounds, and achieve up to 2× higher throughput compared to the state-of-the-art 1-bit Adam while enjoying the same statistical convergence speed and end-to-end model accuracy on GLUE dataset and ImageNet validation set.
“Introducing the AI Research SuperCluster—Meta’s Cutting-edge AI Supercomputer for AI Research”, Lee & Sengupta 2022
Developing the next generation of advanced AI will require powerful new computers capable of quintillions of operations per second. Today, Meta is announcing that we’ve designed and built the AI Research SuperCluster (RSC)—which we believe is among the fastest AI supercomputers running today and will be the fastest AI supercomputer in the world when it’s fully built out in mid-2022. Our researchers have already started using RSC to train large models in natural language processing (NLP) and computer vision for research, with the aim of one day training [dense?] models with trillions of parameters. RSC will help Meta’s AI researchers build new and better AI models that can learn from trillions of examples; work across hundreds of different languages; seamlessly analyze text, images, and video together; develop new augmented reality tools; and much more. Our researchers will be able to train the largest models needed to develop advanced AI for computer vision, NLP, speech recognition, and more. We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people.
…The first generation of this infrastructure, designed in 2017, has 22,000 NVIDIA V100 Tensor Core GPUs in a single cluster that performs 35,000 training jobs a day.
…RSC today comprises a total of 760 NVIDIA DGX A100 systems as its compute nodes, for a total of 6,080 GPUs—with each A100 GPU being more powerful than the V100 used in our previous system. The GPUs communicate via an NVIDIA Quantum 200 Gb/s InfiniBand 2-level Clos fabric that has no oversubscription. RSC’s storage tier has 175 petabytes of Pure Storage FlashArray [NVM], 46 petabytes of cache storage in Penguin Computing Altus [AMD Epyc] systems, and 10 petabytes of Pure Storage FlashBlade.
…Once we complete phase 2 of building out RSC, we believe it will be the fastest AI supercomputer in the world [past NVIDIA’s Selene used for Megatron-Turing NLG 530B, currently on par with Perlmutter], performing at nearly 5 exaflops of mixed precision compute. Through 2022, we’ll work to increase the number of GPUs from 6,080 to 16,000, which will increase AI training performance by more than 2.5×. [bringing it past Summit] The InfiniBand fabric will expand to support 16,000 ports in a 2-layer topology with no oversubscription. The storage system will have a target delivery bandwidth of 16 TB⁄s and exabyte-scale capacity to meet increased demand.
“Is Programmable Overhead Worth The Cost? How Much Do We Pay for a System to Be Programmable? It Depends upon Who You Ask”, Bailey 2022
In his 2021-12-07 DAC keynote “GPUs, Machine Learning, and EDA”, Bill Dally, chief scientist and senior VP of research at Nvidia, compared some of the processors his company has developed with custom accelerators for AI. “The overhead of fetching and decoding, all the overhead of programming, of having a programmable engine, is on the order of 10% to 20%—small enough that there’s really no gain to a specialized accelerator. You get at best 20% more performance and lose all the advantages and flexibility that you get by having a programmable engine”, he said.
Later in his talk he broke this down into a little more detail:
“If you are doing a single half-precision floating-point multiply/add (HFMA), which is where we started with Volta, your energy per operation is about 1.5 picojoules, and your overhead is 30 picojoules [see Figure 2]. You’ve got a 20× overhead. You’re spending 20× as much energy on the general administration than you are in the engineering department. But if you start amortizing (using more complex instructions), you get to only 5× with the dot product instruction, 20% with the half-precision matrix multiply accumulate (HMMA), and 16% for the integer multiply accumulate (IMMA). At that point, the advantages of programmability are so large, there’s no point making a dedicated accelerator. You’re much better off building a general-purpose programmable engine, like a GPU, and having some instructions you accelerate.”
That does not sit well with many people, and it certainly is not reflected by the billions of venture capital flowing into AI accelerators.
“GPU-accelerated computing and machine learning (ML) have revolutionized computer graphics, computer vision, speech recognition, and natural language processing. We expect ML and GPU-accelerated computing will also transform EDA software and as a result, chip design workflows. Recent research shows that orders of magnitudes of speedups are possible with accelerated computing platforms and that the combination of GPUs and ML can enable automation on tasks previously seen as intractable or too difficult to automate. This talk will cover near-term applications of GPUs and ML to EDA tools and chip design as well as a long term vision of what is possible. The talk will also cover advances in GPUs and ML-hardware that are enabling this revolution.”]
“Startup Tenstorrent shows AI is changing computing and vice versa: Tenstorrent is one of the rush of AI chip makers founded in 2016 and finally showing product. The new wave of chips represent a substantial departure from how traditional computer chips work, but also point to ways that neural network design may change in the years to come”
“SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient”, Ryabinin et al 2021
Many deep learning applications benefit from using large models with billions of parameters. These models can only be trained with specialized distributed training algorithms that require low-latency and high-bandwidth interconnect. As a result, large models are typically trained in dedicated GPU clusters that can be extremely costly to deploy and operate. In contrast, there are more affordable distributed training setups, such as using cheap “preemptible” instances or pooling together existing resources from multiple regions. However, both these setups come with unique challenges that make it impractical to train large models using conventional model parallelism. In this work, we carefully analyze these challenges and find configurations where training larger models becomes less communication-intensive. Based on these observations, we propose SWARM Parallelism (Stochastically Wired Adaptively Rebalanced Model Parallelism)—a model-parallel training algorithm designed for swarms of poorly connected, heterogeneous unreliable devices. SWARM creates temporary randomized pipelines between available nodes that are rebalanced in case of failure. To further reduce the network usage of our approach, we develop several compression-aware architecture modifications and evaluate their tradeoffs. Finally, we combine our insights to train a large Transformer language model with 1.1B shared parameters (~13B before sharing) on a swarm of preemptible T4 GPUs with less than 400Mb/s network throughput.
[Keywords: distributed training, model-parallel, model parallelism, pipeline, fault tolerance, communication efficiency, volunteer computing]
“M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining”, Lin et al 2021
Recent expeditious developments in deep learning algorithms, distributed training, and even hardware design for large models have enabled training extreme-scale models, say GPT-3 and Switch Transformer possessing hundreds of billions or even trillions of parameters. However, under limited resources, extreme-scale model training that requires enormous amounts of computes and memory footprint suffers from frustratingly low efficiency in model convergence. In this paper, we propose a simple training strategy called “Pseudo-to-Real” for high-memory-footprint-required large models. Pseudo-to-Real is compatible with large models with architecture of sequential layers. We demonstrate a practice of pretraining unprecedented 10-trillion-parameter model, an order of magnitude larger than the state-of-the-art, on solely 512 GPUs within 10 days. Besides demonstrating the application of Pseudo-to-Real, we also provide a technique, Granular CPU offloading, to manage CPU memory for training large model and maintain high GPU utilities. Fast training of extreme-scale models on a decent amount of resources can bring much smaller carbon footprint and contribute to greener AI.
[Keywords: Extreme-Scale Pretraining, Language Modeling, Natural Language Processing]
“China Has Already Reached Exascale—On Two Separate Systems”, (2021-10-26; ; ; similar):
[see also “Why Did China Keep Its Exascale Supercomputers Quiet?”] …The supercomputing community has long been used to public results on the Top 500 list of the world’s most powerful systems with countries actively vying for supremacy. However, with tensions at peak and the entity list haunting the spirit of international competition, we can expect China to remain mum about some dramatic system leaps. Including the fact that the country has already broken the (true/LINPACK) exascale barrier in 2021—on more than one machine.
We have it on outstanding authority (under condition of anonymity) that LINPACK was run in March 2021 on the Sunway “Oceanlite” system, which is the follow-on to the #4-ranked Sunway TaihuLight machine. The results yielded 1.3 exaflops peak performance with 1.05 sustained performance in the ideal 35 megawatt power sweet spot.
…The same authority confirmed that a second exascale run in China, this time on the Tianhe-3 system, which we previewed back in May 2019, reached almost identical performance with 1.3 exaflops peak and enough sustained to be functional exascale. We do not have a power figure for this but we were able to confirm this machine is based on the FeiTeng line of processors from Phytium, which is Arm-based with a matrix accelerator. (For clarity, FeiTeng is kind of like “Xeon”, it’s a brand of CPUs from Phytium).
…From what we can tell on these 2 exascale systems there are modest changes to architectures, doubling of chip elements and sockets. That is not to minimize the effort, but we do not suspect new architectures emerging that can fit another coming bit of news, a so-called Futures program that aims to deliver a 20 exaflops supercomputer by 2025, according to our same source, who is based in the United States but in the know about happenings in China.
But here’s something to keep in mind as we go forward in this frigid international climate: perhaps we can no longer expect to have a clear, Top 500 supercomputer list view into national competitiveness in quite the same way. If China, always a contender with the United States, is running LINPACK but not making the results public, what happens to the validity and international importance of that list, which has been a symbol of HPC progress for decades? What does China have to lose, would it not be in the national interest to show off not one, but 2 validated exascale for both peak and sustained results?
Here is something subtle to consider: the forthcoming “Frontier” supercomputer at Oak Ridge National Lab in the U.S. is expected to debut with 1.5 peak exaflops and an expected sustained figure around 1.3 exaflops. Perhaps China has decided to quietly leak that they are first to true exascale without having to publish benchmark results that might show a slightly better performance figure for a US-based machine. Just something to think about.
And here’s another subtle detail. Our source confirms these LINPACK results for both of China’s exascale systems—the first in the world—were achieved in March 2021. When did the entity list appear citing Phytium and Sunway and the centers that host their showboat systems? In April 2021.
“PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management”, Fang et al 2021
The pre-trained model (PTM) is revolutionizing Artificial intelligence (AI) technology. It can learn general language features on massive data and then be fine-tuned on task-specific data. Unfortunately, the computing hardware requirement of PTM training is prohibitively expensive, which makes it a game for a small proportion of people in the AI community.
Therefore, we proposed a system called PatrickStar to lower the hardware requirements of PTMs and make them accessible to everyone. PatrickStar uses the CPU-GPU heterogeneous memory space to store the model data. Different from existing works, we first manage the model data in a fine-grained manner by organizing them in memory chunks and dynamically distributing them in the heterogeneous memory space. Guided by the runtime memory statistics collected in a warm-up iteration, chunks are orchestrated efficiently in heterogeneous memory and generate lower CPU-GPU data transmission volume. Symbiosis with the Zero Redundancy Optimizer, PatrickStar scales to multiple GPUs using data parallelism, with lower communication bandwidth requirements and more efficient bandwidth utilization. The system can train tasks on bigger models and larger batch sizes, which existing works cannot complete.
Experimental results show that PatrickStar trains a 12 billion parameters GPT model, 1.5× as large as the model scale limit of the SOTA works, on an 8×V100 and 240GB CPU memory node, and also achieves significantly higher computing efficiency than SOTA. Even on a $700 personal computer, it can train a 0.7 billion parameter GPT model.
Our code is publicly available at Github.
“Chimera: Efficiently Training Large-Scale Neural Networks With Bidirectional Pipelines”, Li & Hoefler 2021
Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models. Chimera is a synchronous approach and therefore no loss of accuracy, which is more convergence-friendly than asynchronous approaches. Compared with the latest synchronous pipeline approach, Chimera reduces the number of bubbles by up to 50%; benefiting from the sophisticated scheduling of bidirectional pipelines, Chimera has a more balanced activation memory consumption. Evaluations are conducted on Transformer based language models. For a GPT-2 model with 1.3 billion parameters running on 2,048 GPU nodes of the Piz Daint supercomputer, Chimera improves the training throughput by 1.16×-2.34× over the state-of-the-art synchronous and asynchronous pipeline approaches.
“First-Generation Inference Accelerator Deployment at Facebook”, (2021-07-08; ):
In this paper, we provide a deep dive into the deployment of inference accelerators at Facebook. Many of our ML workloads have unique characteristics, such as sparse memory accesses, large model sizes, as well as high compute, memory and network bandwidth requirements. We co-designed a high-performance, energy-efficient inference accelerator platform based on these requirements.
We describe the inference accelerator platform ecosystem we developed and deployed at Facebook: both hardware, through Open Compute Platform (OCP), and software framework and tooling, through Pytorch/Caffe2/Glow. A characteristic of this ecosystem from the start is its openness to enable a variety of AI accelerators from different vendors. This platform, with six low-power accelerator cards alongside a single-socket host CPU, allows us to serve models of high complexity that cannot be easily or efficiently run on CPUs.
We describe various performance optimizations, at both platform and accelerator level, which enables this platform to serve production traffic at Facebook. We also share deployment challenges, lessons learned during performance optimization, as well as provide guidance for future inference hardware co-design.
2021-jouppi.pdf: “Ten Lessons From Three Generations Shaped Google’s TPUv4i”, (2021-06-14; ):
Google deployed several TPU generations since 2015, teaching us lessons that changed our views: semiconductor technology advances unequally; compiler compatibility trumps binary compatibility, especially for VLIW domain-specific architectures (DSAs); target total cost of ownership vs initial cost; support multi-tenancy; deep neural networks (DNN) grow 1.5× annually; DNN advances evolve workloads; some inference tasks require floating point; inference DSAs need air-cooling; apps limit latency, not batch size; and backwards ML compatibility helps deploy DNNs quickly. These lessons molded TPUv4i, an inference DSA deployed since 2020.
Table 1: Key characteristics of DSAs. The underlines show changes over the prior TPU generation, from left to right. System TDP includes power for the DSA memory system plus its share of the server host power, eg. add host TDP⁄8 for 8 DSAs per host.
- Document the unequal improvement in logic, wires, SRAM, and DRAM from 45 nm to 7 nm—including an update of Horowitz’s operation energy table16 from 45 nm to 7 nm—and show how these changes led to 4 systolic floating point matrix units for TPUv4i in 2020 versus one systolic integer matrix unit for TPUv1 in 2015.
- Explain the difference between designing for performance per TCO vs per CapEx, leading to HBM and a low TDP for TPUv4i, and show how TPUv1’s headroom led to application scaleup after the 2017 paper21.
- Explain backwards ML compatibility, including why inference can need floating point and how it spurred the TPUv4i and TPUv4 designs (§3). Backwards ML compatible training also tailors DNNs to TPUv4i (§2).
- Measure production inference applications to show that DSAs normally run multiple DNNs concurrently, requiring Google inference DSAs to support multi-tenancy.
- Discuss how DNN advances change the production inference workload. The 2020 workload keeps MLP and CNN from 2017 but adds BERT, and RNN succeeds LSTM.
- Document the growth of production DNNs in memory size and computation by ~1.5× annually since 2016, which encourages designing DSAs with headroom.
- Show that Google’s TCO and TDP for DNN DSAs are strongly correlated (r = 0.99), likely due to the end of Dennard scaling. TDP offers a good proxy for DSA TCO.
- Document that the SLO limit is P99 time for inference applications, list typical batch sizes, and show how large on-chip SRAM helps P99 performance.
- Explain why TPUv4i architects chose compiler compatibility over binary compatibility for its VLIW ISA.
- Describe Google’s latest inference accelerator in production since March 2020 and evaluate its performance/TDP vs. TPUv3 and NVIDIA’s T4 inference GPU using production apps and MLPerf Inference benchmarks 0.5–0.7.
…TPUv1 required quantization—since it supported only integer arithmetic—which proved a problem for some datacenter applications. Early in TPUv1 development, application developers said a 1% drop in quality was acceptable, but they changed their minds by the time the hardware arrived, perhaps because DNN overall quality improved so that 1% added to a 40% error was relatively small but 1% added to a 12% error was relatively large.
…Alas, DNN DSA designers often ignore multi-tenancy. Indeed, multi-tenancy is not mentioned in the TPUv1 paper21. (It was lucky that the smallest available DDR3 DRAM held 8GB, allowing TPUv1 software to add multi-tenancy.)
…BERT appeared in 2018, yet it’s already 28% of the workload.
…Crucially, PanGu was a joint effort by researchers from both Huawei and Recurrent AI (循环智能), a provider of AI enterprise services. I was curious about PanGu. A simple search led me to a Zhihu thread titled: “What do you think of the PanGu model released by Huawei on April 25?” Zhihu, known as China’s Quora, is the country’s largest Q&A forum. The initial post linked to an article by Recurrent AI on PanGu. Plus, there were 40 responses to the thread, many of which were very insightful.
Key Takeaways from article linked in the initial Zhihu post: In the article, Recurrent AI claims that PanGu improves on GPT-3 in 3 aspects. The key word here is “claims” as I wasn’t able to trace many of these points to the results reported in the PanGu article itself:
- First, it supposedly “surpasses GPT-3 in few-shot learning tasks, addressing issues the latter faces in dealing with complex commercial scenarios with few (training data) samples. For example, in scenarios involving customer voice analysis and analysis of employees’ ability to carry out tasks, when the PanGu NLP large model is used to produce semantic analysis, the sample size required to obtain the target result is only one-tenth of the GPT-3 model. That is, AI’s production efficiency can be increased 10×.”
- Second, the PanGu team added prompt-based tasks in the pre-training phase, which greatly reduced the difficulty of fine-tuning. There have been difficulties with fine-tuning previous large models for different industry scenarios. One example from the article: “In a scenario about finding more target customers to increase the conversion rate, in which companies use communication content to determine customer purchase intentions, we found that the PanGu model can increase the order conversion rate by 27% compared to GPT-38.”
- I’m not completely sure what Recurrent AI is arguing on the third innovation that PanGu makes on top of GPT-3. They write, “PanGu can recognize intent (of customers?) through few-shot learning, and transform them into queries of knowledge bases and databases, which addresses the issue that large models are difficult to integrate with industry knowledge and data in the past.” My best guess is that they are arguing PanGu can adapt better to industry-specific vocabularies and communications.
- …Right at the beginning of his post, Jin clarifies that Huawei actually released 2 large NLP models at the HDC conference (both named after PanGu). The other one was an encoder-decoder Transformer. Here’s the key point: the training of both 100-billion parameter scale models was a collaboration between various Huawei divisions and Peng Cheng Lab (PCL), which provided computing power support…CloudBrain 1 is a large-scale cluster system with 100 Petaflops of computing power, including NVIDIA GPUs, Huawei GPUs, and Cambrian AI chips. A machine of 1000 Petaflops will probably be built next year, which can be used by universities, research institutes, and SMEs for training models. The goal of 1000 Petaflops (an exaflop) is generally considered a big milestone for compute over the next few years
- He concludes with his expectation of future trends: “In order to gain more knowledge from pre-training, models such as GPT-3 and PanGu will become larger and larger. After all, we have not seen the limit of the pre-training benefits for large models. At that time, this type of model will have greater infrastructure requirements, and data parallelism and optimization strategies will be more complex . . . in the future, we need more researchers to devote themselves to the research of general intelligence and large-scale distributed computing.”
“GSPMD: General and Scalable Parallelization for ML Computation Graphs”, (2021-05-10; ; similar):
[blog] We present GSPMD, an automatic, compiler-based parallelization system for common machine learning computation graphs. It allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute tensors, based on which GSPMD will parallelize the computation. Its representation of partitioning is simple yet general, allowing it to express different or mixed paradigms of parallelism on a wide variety of models.
GSPMD infers the partitioning for every operator in the graph based on limited user annotations, making it convenient to scale up existing single-device programs. It solves several technical challenges for production usage, such as static shape constraints, uneven partitioning, exchange of halo data, and nested operator partitioning.
These techniques allow GSPMD to achieve 50% to 62% compute utilization on 128 to 2048 Cloud TPUv3 cores for models with up to one trillion parameters.
GSPMD produces a single program for all devices, which adjusts its behavior based on a run-time partition ID, and uses collective operators for cross-device communication. This property allows the system itself to be scalable: the compilation time stays constant with increasing number of devices.
“PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models With Auto-parallel Computation”, Zeng et al 2021
Large-scale Pretrained Language Models (PLMs) have become the new paradigm for Natural Language Processing (NLP). PLMs with hundreds of billions parameters such as GPT-3 have demonstrated strong performances on natural language understanding and generation with few-shot in-context learning.
In this work, we present our practice on training large-scale autoregressive language models named PanGu-α, with up to 200 billion parameters. PanGu-α is developed under the MindSpore platform and trained on a cluster of 2048 Ascend 910 AI processors. The training parallelism strategy is implemented based on MindSpore Auto-parallel, which composes five parallelism dimensions to scale the training task to 2048 processors efficiently, including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism and rematerialization. To enhance the generalization ability of PanGu-α, we collect 1.1TB high-quality Chinese data from a wide range of domains to pretrain the model.
We empirically test the generation ability of PanGu-α in various scenarios including text summarization, question answering, dialogue generation, etc. Moreover, we investigate the effect of model scales on the few-shot performances across a broad range of Chinese NLP tasks. The experimental results demonstrate the superior capabilities of PanGu-α in performing various tasks under few-shot or zero-shot settings.
“Podracer architectures for scalable Reinforcement Learning”, (2021-04-13; ; ; similar):
Supporting state-of-the-art AI research requires balancing rapid prototyping, ease of use, and quick iteration, with the ability to deploy experiments at a scale traditionally associated with production systems. Deep learning frameworks such as TensorFlow, PyTorch and JAX allow users to transparently make use of accelerators, such as TPUs and GPUs, to offload the more computationally intensive parts of training and inference in modern deep learning systems. Popular training pipelines that use these frameworks for deep learning typically focus on (un-)supervised learning. How to best train reinforcement learning (RL) agents at scale is still an active research area.
In this report we argue that TPUs are particularly well suited for training RL agents in a scalable, efficient and reproducible way. Specifically we describe two architectures designed to make the best use of the resources available on a TPU Pod (a special configuration in a Google data center that features multiple TPU devices connected to each other by extremely low latency communication channels).
Anakin: When using small neural networks and grid-world environments an Anakin architecture can easily perform 5 million steps per second, even on the 8-core TPU accessible for free through Google Colab.
This can be very useful to experiment and debug research ideas in the friendly Colab environment. In Figure 4a we show how, thanks to the efficient network connecting different TPU cores in a Pod, performance scales almost linearly with the number of cores; the collective operations used to average gradients across replicas appear to cause only minimal overhead.
In a recent paper by Oh et al 2021 Anakin was used, at a much larger scale, to discover a general reinforcement learning update, from experience of interacting with a rich set of environments implemented in JAX. In this paper, Anakin was used to learn a single shared update rule from 60K JAX environments and 1K policies running and training in parallel.
Despite the complex nature of the system, based on the use of neural networks to meta-learn not just a policy but the entire RL update, Anakin delivered over 3 million steps per second on a 16-core TPU. Training the update rule to a good level of performance, required running Anakin for ~24 hours; this would cost ~$100 on GCP’s preemptible instances
…Sebulba: Our second podracer architecture has also been extensively used for exploring a variety of RL ideas at scale, on environments that cannot be compiled to run on TPU (eg. Atari, DMLab and MuJoCo). As both IMPALA and Sebulba are based on a decomposition between actors and learners, agents designed for the IMPALA architecture can be easily mapped onto Sebulba; for instance a Podracer version of the V-trace agent easily reproduced the results from Espeholt et al 2018. However, we found that training an agent for 200 million frames of an Atari game could be done in just ~1 hour, by running Sebulba on a 8-core TPU. This comes at a cost of ~$2.88, on GCP’s pre-emptible instances. This is similar in cost to training with the more complex SEED RL framework, and much cheaper than training an agent for 200 million Atari frames using either IMPALA or single-stream GPU-based system such as that traditionally used by DQN.
…In addition to the trajectory length the effective batch size used to compute each update also depends on how many times we replicate the basic 8-TPU setup. Sebulba also scales effectively along this dimension: using 2048 TPU cores (an entire Pod) we were able to further scale all the way to 43 million frames per second, solving the classic Atari videogame Pong in less than 1 minute…Sebulba has also been used to train search-based agents inspired by MuZero (Schrittwieser et al 2020). The workloads associated to these agents are very different from that of model-free agents like IMPALA. The key difference is in the cost of action selection. This increases because MuZero’s policy combines search with deep neural networks (used to guide and/or truncate the search). Typically, search-based agents like MuZero required custom C++ implementations of the search to deliver good performance. We could reproduce results from MuZero (no Reanalyse) on multiple RL environments, using Sebulba and a pure JAX implementation of MCTS. Training a MuZero agent with Sebulba for 200M Atari frames takes 9 hours on a 16-core TPU (at a cost of ~$40 on GCP’s preemptible instances).
We found that scalability, via replication, was particularly useful in this context. Figure 4c reports the number of frames per seconds processed by Sebulba when running MuZero on Atari, as a function of the number of TPU cores. The throughput increased linearly with the number of cores.
“High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models (DLRMs)”, Mudigere et al 2021
Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs.
We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 trillion parameters and show that we can attain 40× speedup in terms of time to solution over previous systems. We achieve this by
- designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport
- implementing an optimized PyTorch-based training stack supporting both model and data parallelism
- developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers;
- adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates
- leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining.
Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.
“Efficient Large-Scale Language Model Training on GPU Clusters”, (2021-04-09; ; similar):
Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these large models efficiently is challenging for two reasons: (1) GPU memory capacity is limited, making it impossible to fit large models on a single GPU or even on a multi-GPU server; and (2) the number of compute operations required to train these models can result in unrealistically long training times.
New methods of model parallelism such as tensor and pipeline parallelism have been proposed to address these challenges. Unfortunately, naive usage leads to fundamental scaling issues at thousands of GPUs due to various reasons, eg. expensive cross-node communication or idle periods waiting on other devices.
In this work, we show how to compose different types of parallelism methods (tensor, pipeline, and data parallelism) to scale to thousands of GPUs, achieving a two-order-of-magnitude increase in the sizes of models we can efficiently train compared to existing systems. We survey techniques for pipeline parallelism and propose a novel interleaved pipeline parallelism schedule that can improve throughput by more than 10% with comparable memory footprint compared to previously-proposed approaches. We quantitatively study the trade-offs between tensor, pipeline, and data parallelism, and provide intuition as to how to configure distributed training of a large model.
Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs with achieved per-GPU throughput of 52% of peak; previous efforts to train similar-sized models achieve much lower throughput (36% of theoretical peak). Our code is open source.
“Large Batch Simulation for Deep Reinforcement Learning”, (2021-03-12; ; ; similar):
We accelerate deep reinforcement learning-based training in visually complex 3D environments by two orders of magnitude over prior work, realizing end-to-end training speeds of over 19,000 frames of experience per second on a single GPU and up to 72,000 frames per second on a single 8-GPU machine.
The key idea of our approach is to design a 3D renderer and embodied navigation simulator around the principle of “batch simulation”: accepting and executing large batches of requests simultaneously. Beyond exposing large amounts of work at once, batch simulation allows implementations to amortize in-memory storage of scene assets, rendering work, data loading, and synchronization costs across many simulation requests, dramatically improving the number of simulated agents per GPU and overall simulation throughput.
To balance DNN inference and training costs with faster simulation, we also build a computationally efficient policy DNN that maintains high task performance, and modify training algorithms to maintain sample efficiency when training with large mini-batches.
By combining batch simulation and DNN performance optimizations, we demonstrate that PointGoal navigation agents can be trained in complex 3D environments on a single GPU in 1.5 days to 97% of the accuracy of agents trained on a prior state-of-the-art system using a 64-GPU cluster over 3 days. We provide open-source reference implementations of our batch 3D renderer and simulator to facilitate incorporation of these ideas into RL systems.
“TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models”, Li et al 2021
Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0× for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16×large instances compared with state-of-the-art model-parallel methods. The code for reproduction can be found at
“PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers”, He et al 2021
The size of Transformer models is growing at an unprecedented pace. It has only taken less than one year to reach trillion-level parameters after the release of GPT-3 (175B). Training such models requires both substantial engineering efforts and enormous computing resources, which are luxuries most research teams cannot afford. In this paper, we propose PipeTransformer, which leverages automated and elastic pipelining and data parallelism for efficient distributed training of Transformer models. PipeTransformer automatically adjusts the pipelining and data parallelism by identifying and freezing some layers during the training, and instead allocates resources for training of the remaining active layers. More specifically, PipeTransformer dynamically excludes converged layers from the pipeline, packs active layers into fewer GPUs, and forks more replicas to increase data-parallel width. We evaluate PipeTransformer using Vision Transformer (ViT) on ImageNet and BERT on GLUE and SQuAD datasets. Our results show that PipeTransformer attains a 2.4× speedup compared to the state-of-the-art baseline. We also provide various performance analyses for a more comprehensive understanding of our algorithmic and system-wise design. We also develop open-sourced flexible APIs for PipeTransformer, which offer a clean separation among the freeze algorithm, model definitions, and training accelerations, hence allowing it to be applied to other algorithms that require similar freezing strategies.
“Hardware Beyond Backpropagation: a Photonic Co-Processor for Direct Feedback Alignment”, Launay et al 2020
The scaling hypothesis motivates the expansion of models past trillions of parameters as a path towards better performance. Recent important developments, such as GPT-3, have been driven by this conjecture. However, as models scale-up, training them efficiently with backpropagation becomes difficult. Because model, pipeline, and data parallelism distribute parameters and gradients over compute nodes, communication is challenging to orchestrate: this is a bottleneck to further scaling. In this work, we argue that alternative training methods can mitigate these issues, and can inform the design of extreme-scale training hardware. Indeed, using a synaptically asymmetric method with a parallelizable backward pass, such as Direct Feedback Alignment, communication needs are drastically reduced. We present a photonic accelerator for Direct Feedback Alignment, able to compute random projections with trillions of parameters. We demonstrate our system on benchmark tasks, using both fully-connected and graph convolutional networks. Our hardware is the first architecture-agnostic photonic co-processor for training neural networks. This is a substantial step towards building scalable hardware, able to go beyond backpropagation, and opening new avenues for deep learning.
Recent results in language understanding using neural networks have required training hardware of unprecedented scale, with thousands of chips cooperating on a single training run. This paper presents techniques to scale ML models on the Google TPU Multipod, a mesh with 4096 TPU-v3 chips. We discuss model parallelism to overcome scaling limitations from the fixed batch size in data parallelism, communication/collective optimizations, distributed evaluation of training metrics, and host input processing scaling optimizations. These techniques are demonstrated in both the TensorFlow and JAX programming frameworks. We also present performance results from the recent Google submission to the MLPerf-v0.7 benchmark contest, achieving record training times from 16 to 28 seconds in 4 MLPerf models on the Google TPU-v3 Multipod machine.
“Training EfficientNets at Supercomputer Scale: 83% ImageNet Top-1 Accuracy in One Hour”, Wongpanich et al 2020
EfficientNets are a family of state-of-the-art image classification models based on efficiently scaled convolutional neural networks. Currently, EfficientNets can take on the order of days to train; for example, training an EfficientNet-B0 model takes 23 hours on a Cloud TPU v2–8 node. In this paper, we explore techniques to scale up the training of EfficientNets on TPU-v3 Pods with 2048 cores, motivated by speedups that can be achieved when training at such scales. We discuss optimizations required to scale training to a batch size of 65536 on 1024 TPU-v3 cores, such as selecting large batch optimizers and learning rate schedules as well as utilizing distributed evaluation and batch normalization techniques. Additionally, we present timing and performance benchmarks for EfficientNet models trained on the ImageNet dataset in order to analyze the behavior of EfficientNets at scale. With our optimizations, we are able to train EfficientNet on ImageNet to an accuracy of 83% in 1 hour and 4 minutes.
“Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws?”, Domke et al 2020
Matrix engines or units, in different forms and affinities, are becoming a reality in modern processors; CPUs and otherwise. The current and dominant algorithmic approach to Deep Learning merits the commercial investments in these units, and deduced from the No.1 benchmark in supercomputing, namely High Performance Linpack, one would expect an awakened enthusiasm by the HPC community, too.
Hence, our goal is to identify the practical added benefits for HPC and machine learning applications by having access to matrix engines. For this purpose, we perform an in-depth survey of software stacks, proxy applications and benchmarks, and historical batch job records. We provide a cost-benefit analysis of matrix engines, both asymptotically and in conjunction with state-of-the-art processors. While our empirical data will temper the enthusiasm, we also outline opportunities to misuse these dense matrix-multiplication engines if they come for free.
The number of parameters in state of the art neural networks has drastically increased in recent years. This surge of interest in large scale neural networks has motivated the development of new distributed training strategies enabling such models. One such strategy is model-parallel distributed training. Unfortunately, model-parallelism suffers from poor resource utilisation, which leads to wasted resources.
In this work, we improve upon recent developments in an idealised model-parallel optimisation setting: local learning. Motivated by poor resource utilisation, we introduce a class of intermediary strategies between local and global learning referred to as interlocking backpropagation. These strategies preserve many of the compute-efficiency advantages of local optimisation, while recovering much of the task performance achieved by global optimisation. We assess our strategies on both image classification ResNets and Transformer language models, finding that our strategy consistently out-performs local learning in terms of task performance, and out-performs global learning in training efficiency.
I am worried we’re in an overhang right now. I think we right now have the ability to build an orders-of-magnitude more powerful system than we already have, and I think GPT-3 is the trigger for 100× larger projects at Google, Facebook and the like, with timelines measured in months.
…GPT-3 has been estimated to cost $5m in compute to train, and—looking at the author list and OpenAI’s overall size—maybe another $10m in labour.
Google, Amazon and Microsoft each spend about $20bn/year on R&D and another $20bn each on capital expenditure. Very roughly, it totals to $100bn/year. Against this budget, dropping $1bn or more on scaling GPT up by another factor of 100× is entirely plausible right now. All that’s necessary is that tech executives stop thinking of natural language processing as cutesy blue-sky research and start thinking in terms of quarters-till-profitability. A concrete example is Waymo, which is raising $2bn investment rounds—and that’s for a technology with a much longer road to market…The current hardware floor is nearer to the RTX 2080 TI’s $1k/unit for 125 tensor-core TFLOPS, and that gives you $25/PFLOPS-day. This roughly aligns with AI Impacts’ current estimates, and offers another >10× speedup to our model.
…I think the key question is if by 1000×, a GPT successor is obviously superior to humans over a wide range of economic activities. If it is—and I think it’s plausible that it will be—then further investment will arrive through the usual market mechanisms, until the largest models are being allocated a substantial fraction of global GDP. On paper that leaves room for another 1000× scale-up as it reaches up to $1tn, though current market mechanisms aren’t really capable of that scale of investment. Left to the market as-is, I think commoditization would kick in as the binding constraint.
That’s from the perspective of the market today though. Transformative AI might enable $100tn-market-cap companies, or nation-states could pick up the torch. The Apollo Program made for a $1tn-today share of GDP, so this degree of public investment is possible in principle.
“Sample Factory: Egocentric 3D Control from Pixels at 100,000 FPS With Asynchronous Reinforcement Learning”, Petrenko et al 2020
Increasing the scale of reinforcement learning experiments has allowed researchers to achieve unprecedented results in both training sophisticated agents for video games, and in sim-to-real transfer for robotics. Typically such experiments rely on large distributed systems and require expensive hardware setups, limiting wider access to this exciting area of research. In this work we aim to solve this problem by optimizing the efficiency and resource utilization of reinforcement learning algorithms instead of relying on distributed computation. We present the “Sample Factory”, a high-throughput training system optimized for a single-machine setting. Our architecture combines a highly efficient, asynchronous, GPU-based sampler with off-policy correction techniques, allowing us to achieve throughput higher than 105 environment frames/second on non-trivial control problems in 3D without sacrificing sample efficiency. We extend Sample Factory to support self-play and population-based training and apply these techniques to train highly capable agents for a multiplayer first-person shooter game. The source code is available at
“There’s Plenty of Room at the Top: What Will Drive Computer Performance After Moore’s Law?”, Leiserson et al 2020
2020-leiserson.pdf: “There’s plenty of room at the Top: What will drive computer performance after Moore’s law?”, (2020-06-05; ; ; similar):
From bottom to top: The doubling of the number of transistors on a chip every 2 years, a seemly inevitable trend that has been called Moore’s law, has contributed immensely to improvements in computer performance. However, silicon-based transistors cannot get much smaller than they are today, and other approaches should be explored to keep performance growing. Leiserson et al review recent examples and argue that the most promising place to look is at the top of the computing stack, where improvements in software, algorithms, and hardware architecture can bring the much-needed boost…The miniaturization of semiconductor transistors has driven the growth in computer performance for more than 50 years. As miniaturization approaches its limits, bringing an end to Moore’s law, performance gains will need to come from software, algorithms, and hardware. We refer to these technologies as the “Top” of the computing stack to distinguish them from the traditional technologies at the “Bottom”: semiconductor physics and silicon-fabrication technology. In the post-Moore era, the Top will provide substantial performance gains, but these gains will be opportunistic, uneven, and sporadic, and they will suffer from the law of diminishing returns. Big system components offer a promising context for tackling the challenges of working at the Top.
…Unfortunately, semiconductor miniaturization is running out of steam as a viable way to grow computer performance—there isn’t much more room at the “Bottom.” If growth in computing power stalls, practically all industries will face challenges to their productivity. Nevertheless, opportunities for growth in computing performance will still be available, especially at the “Top” of the computing-technology stack: software, algorithms, and hardware architecture.
Advances: Software can be made more efficient by performance engineering: restructuring software to make it run faster. Performance engineering can remove inefficiencies in programs, known as software bloat, arising from traditional software-development strategies that aim to minimize an application’s development time rather than the time it takes to run. Performance engineering can also tailor software to the hardware on which it runs, for example, to take advantage of parallel processors and vector units.
Algorithms offer more-efficient ways to solve problems. Indeed, since the late 1970s, the time to solve the maximum-flow problem improved nearly as much from algorithmic advances as from hardware speedups. But progress on a given algorithmic problem occurs unevenly and sporadically and must ultimately face diminishing returns. As such, we see the biggest benefits coming from algorithms for new problem domains (eg. machine learning) and from developing new theoretical machine models that better reflect emerging hardware.
Hardware architectures can be streamlined—for instance, through processor simplification, where a complex processing core is replaced with a simpler core that requires fewer transistors. The freed-up transistor budget can then be redeployed in other ways—for example, by increasing the number of processor cores running in parallel, which can lead to large efficiency gains for problems that can exploit parallelism. Another form of streamlining is domain specialization, where hardware is customized for a particular application domain. This type of specialization jettisons processor functionality that is not needed for the domain. It can also allow more customization to the specific characteristics of the domain, for instance, by decreasing floating-point precision for machine-learning applications.
In the post-Moore era, performance improvements from software, algorithms, and hardware architecture will increasingly require concurrent changes across other levels of the stack. These changes will be easier to implement, from engineering-management and economic points of view, if they occur within big system components: reusable software with typically more than a million lines of code or hardware of comparable complexity. When a single organization or company controls a big component, modularity can be more easily re-engineered to obtain performance gains. Moreover, costs and benefits can be pooled so that important but costly changes in one part of the big component can be justified by benefits elsewhere in the same component.
OUTLOOK: As miniaturization wanes, the silicon-fabrication improvements at the Bottom will no longer provide the predictable, broad-based gains in computer performance that society has enjoyed for more than 50 years. Software performance engineering, development of algorithms, and hardware streamlining at the Top can continue to make computer applications faster in the post-Moore era. Unlike the historical gains at the Bottom, however, gains at the Top will be opportunistic, uneven, and sporadic. Moreover, they will be subject to diminishing returns as specific computations become better explored.
2020-jouppi.pdf#google: “A domain-specific supercomputer for training deep neural networks”, (2020-06-01; ):
Google’s TPU supercomputers train deep neural networks 50× faster than general-purpose supercomputers running a high-performance computing benchmark.
Microsoft has built one of the top five publicly disclosed supercomputers in the world, making new infrastructure available in Azure to train extremely large artificial intelligence models, the company is announcing at its Build developers conference.
Built in collaboration with and exclusively for OpenAI, the supercomputer hosted in Azure was designed specifically to train that company’s AI models. It represents a key milestone in a partnership announced last year to jointly create new supercomputing technologies in Azure.
…It’s also a first step toward making the next generation of very large AI models and the infrastructure needed to train them available as a platform for other organizations and developers to build upon. “The exciting thing about these models is the breadth of things they’re going to enable”, said Microsoft Chief Technical Officer Kevin Scott, who said the potential benefits extend far beyond narrow advances in one type of AI model. “This is about being able to do a hundred exciting things in natural language processing at once and a hundred exciting things in computer vision, and when you start to see combinations of these perceptual domains, you’re going to have new applications that are hard to even imagine right now”, he said…Microsoft is also exploring large-scale AI models that can learn in a generalized way across text, images and video. That could help with automatic captioning of images for accessibility in Office, for instance, or improve the ways people search Bing by understanding what’s inside images and videos.
…The supercomputer developed for OpenAI is a single system with more than 285,000 CPU cores, 10,000 GPUs and 400 gigabits per second of network connectivity for each GPU server. Compared with other machines listed on the TOP500 supercomputers in the world, it ranks in the top five, Microsoft says. Hosted in Azure, the supercomputer also benefits from all the capabilities of a robust modern cloud infrastructure, including rapid deployment, sustainable datacenters and access to Azure services.
“Computation in the Human Cerebral Cortex Uses Less Than 0.2 Watts yet This Great Expense Is Optimal When considering Communication Costs”, Levy & Calvert 2020
[Later: Levy & Calvert 2021] Darwinian evolution tends to produce energy-efficient outcomes. On the other hand, energy limits computation, be it neural and probabilistic or digital and logical.
After establishing an energy-efficient viewpoint, we define computation and construct an energy-constrained, computational function that can be optimized.
This function implies a specific distinction between ATP-consuming processes, especially computation per se vs action potentials and other costs of communication. As a result, the partitioning of ATP-consumption here differs from earlier work. A bits/J optimization of computation requires an energy audit of the human brain. Instead of using the oft-quoted 20 watts of glucose available to the brain1, 2, the partitioning and audit reveals that cortical computation consumes 0.2 watts of ATP while long-distance communication costs are over 20× greater. The bits/joule computational optimization implies a transient information rate of more than 7 bits/sec/neuron.
Significance Statement: Engineers hold up the human brain as a low energy form of computation. However from the simplest physical viewpoint, a neuron’s computation cost is remarkably larger than the best possible bits/joule—off by a factor of 108.
Here we explicate, in the context of energy consumption, a definition of neural computation that is optimal given explicit constraints. The plausibility of this definition as Nature’s perspective is supported by an energy-audit of the human brain.
The audit itself requires certain novel perspectives and calculations revealing that communication costs are 20× computational costs.
New hardware can substantially increase the speed and efficiency of deep neural network training. To guide the development of future hardware architectures, it is pertinent to explore the hardware and machine learning properties of alternative training algorithms. In this work we evaluate the use of small batch, fine-grained Pipelined Backpropagation, an asynchronous pipeline parallel training algorithm that has significant hardware advantages. We introduce two methods, Spike Compensation and Linear Weight Prediction, that effectively mitigate the downsides caused by the asynchronicity of Pipelined Backpropagation and outperform existing techniques in our setting. We show that appropriate normalization and small batch sizes can also aid training. With our methods, fine-grained Pipelined Backpropagation using a batch size of one can match the accuracy of SGD for multiple networks trained on CIFAR-10 and ImageNet. Simple scaling rules allow the use of existing hyperparameters for traditional training without additional tuning.
Release Prices 95th-percentile Active Prices 95th-percentile Active Prices (pre-crypto price rise) 2007-11–2010-01 2011-03–2020-01 2011-03–2016-12 $ / single-precision FLOPS 12.5 17 16 9/2014–1/2020 1/2015–1/2020 1/2015–12/2016 $ / half-precision FLOPS 8 10 8 $ / half-precision FMA FLOPS 4 4.5 —
Release price data seems to generally support the trends we found in active prices, with the notable exception of trends in GPU price / single-precision FLOPS, which cannot be explained solely by the different start dates.[^58^](https://aiimpacts.org/2019-recent-trends-in-gpu-price-per-flops/#easy-footnote-bottom-58-2316 “See our analysis in this section above.”) We think the best estimate of the overall trend for prices at which people recently bought GPUs is the 95th-percentile active price data from 2011–2020, since release price data does not account for existing GPUs becoming cheaper over time. The pre-crypto trends are similar to the overall trends, suggesting that the trends we are seeing are not anomalous due to cryptocurrency.
Given that, we guess that GPU prices as a whole have fallen at rates that would yield an order of magnitude over roughly:
- 17 years for single-precision FLOPS
- 10 years for half-precision FLOPS
- 5 years for half-precision fused multiply-add FLOPS
Half-precision FLOPS seem to have become cheaper substantially faster than single-precision in recent years. This may be a “catching up” effect as more of the space on GPUs was allocated to half-precision computing, rather than reflecting more fundamental technological progress.
[Keywords: AI Timelines, Featured Articles, Hardware and AI Timelines, Hardware progress]
Deep video recognition is more computationally expensive than image recognition, especially on large-scale datasets like Kinetics. Therefore, training scalability is essential to handle a large amount of videos.
In this paper, we study the factors that impact the training scalability of video networks. We recognize three bottlenecks, including data loading (data movement from disk to GPU), communication (data movement over networking), and computation FLOPs. We propose three design guidelines to improve the scalability: (1) fewer FLOPs and hardware-friendly operator to increase the computation efficiency; (2) fewer input frames to reduce the data movement and increase the data loading efficiency; (3) smaller model size to reduce the networking traffic and increase the networking efficiency.
With these guidelines, we designed a new operator, Temporal Shift Module (TSM), that is efficient and scalable for distributed training.
TSM model can achieve 1.8× higher throughput compared to previous I3D models. We scale up the training of the TSM model to 1,536 GPUs, with a mini-batch of 12,288 video clips/98,304 images, without losing the accuracy. With such hardware-aware model design, we are able to scale up the training on Summit supercomputer and reduce the training time on Kinetics dataset from 49 hours 55 minutes to 14 minutes 13 seconds, achieving a top-1 accuracy of 74.0%, which is 1.6× and 2.9× faster than previous 3D video models with higher accuracy.
The code and more details can be found here: http: / / tsm-hanlab.mit.edu .
[Further reading: “Parameter Counts In Machine Learning” (2021-06-19), Akronomicon leaderboard.] We’re releasing an analysis showing that since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4-month doubling time (by comparison, Moore’s Law had a 2-year doubling period). Since 2012, this metric has grown by more than 300,000× (a 2-year doubling period would yield only a 7× increase). Improvements in compute have been a key component of AI progress, so as long as this trend continues, it’s worth preparing for the implications of systems far outside today’s capabilities.
Three factors drive the advance of AI: algorithmic innovation, data (which can be either supervised data or interactive environments), and the amount of compute available for training. Algorithmic innovation and data are difficult to track, but compute is unusually quantifiable, providing an opportunity to measure one input to AI progress. Of course, the use of massive compute sometimes just exposes the shortcomings of our current algorithms. But at least within many current domains, more compute seems to lead predictably to better performance, and is often complementary to algorithmic advances…The trend represents an increase by roughly a factor of 10 each year. It’s been partly driven by custom hardware that allows more operations to be performed per second for a given price (GPUs and TPUs), but it’s been primarily propelled by researchers repeatedly finding ways to use more chips in parallel and being willing to pay the economic cost of doing so.
Eras: Looking at the graph we can roughly see 4 distinct eras:
- Before 2012: It was uncommon to use GPUs for ML, making any of the results in the graph difficult to achieve.
- 2012–2014: Infrastructure to train on many GPUs was uncommon, so most results used 1–8 GPUs rated at 1–2 TFLOPS for a total of 0.001–0.1 pfs-days.
- 2014–2016: Large-scale results used 10–100 GPUs rated at 5–10 TFLOPS, resulting in 0.1–10 pfs-days. Diminishing returns on data parallelism meant that larger training runs had limited value.
- 2016–2017: Approaches that allow greater algorithmic parallelism such as huge batch sizes, architecture search, and expert iteration, along with specialized hardware such as TPUs and faster interconnects, have greatly increased these limits, at least for some applications.
AlphaGoZero/AlphaZero is the most visible public example of massive algorithmic parallelism, but many other applications at this scale are now algorithmically possible, and may already be happening in a production context.
…Addendum: Compute used in older headline results (2019-11-07)
We’ve updated our analysis with data that span 1959 to 2012. Looking at the data as a whole, we clearly see two distinct eras of training AI systems in terms of compute-usage: (a) a first era, from 1959 to 2012, which is defined by results that roughly track Moore’s law, and (b) the modern era, from 2012 to now, of results using computational power that substantially outpaces macro trends. The history of investment in AI broadly is usually told as a story of booms and busts, but we don’t see that reflected in the historical trend of compute used by learning systems. It seems that AI winters and periods of excitement had a small effect on compute used to train models over the last half-century.
Starting from the perceptron in 1959, we see a ~2-year doubling time for the compute used in these historical results—with a 3.4-month doubling time starting in ~2012. It’s difficult to draw a strong conclusion from this data alone, but we believe that this trend is probably due to a combination of the limits on the amount of compute that was possible to use for those results and the willingness to spend on scaling up experiments. [For one vivid account of the history of computing in AI in this period, see the “False Start” section in Hans Moravec’s 1998 article.]
“When will computer hardware match the human brain?”, (1998; ; ; similar):
This paper describes how the performance of AI machines tends to improve at the same pace that AI researchers get access to faster hardware. The processing power and memory capacity necessary to match general intellectual performance of the human brain are estimated. Based on extrapolation of past trends and on examination of technologies under development, it is predicted that the required hardware will be available in cheap machines in the 2020s…At the present rate, computers suitable for human-like robots will appear in the 2020s. Can the pace be sustained for another three decades?
…By 1990, entire careers had passed in the frozen winter of 1-MIPS computers, mainly from necessity, but partly from habit and a lingering opinion that the early machines really should have been powerful enough. In 1990, 1 MIPS cost $2,467.0$1,000.01990 in a low-end personal computer. There was no need to go any lower. Finally spring thaw has come. Since 1990, the power available to individual AI and robotics programs has doubled yearly, to 30 MIPS by 1994 and 500 MIPS by 1998. Seeds long ago alleged barren are suddenly sprouting. Machines read text, recognize speech, even translate languages. Robots drive cross-country, crawl across Mars, and trundle down office corridors. In 1996 a theorem-proving program called EQP running five weeks on a 50 MIPS computer at Argonne National Laboratory found a proof of a boolean algebra conjecture by Herbert Robbins that had eluded mathematicians for sixty years. And it is still only spring. Wait until summer.
…The mental steps underlying good human chess playing and theorem proving are complex and hidden, putting a mechanical interpretation out of reach. Those who can follow the play naturally describe it instead in mentalistic language, using terms like strategy, understanding and creativity. When a machine manages to be simultaneously meaningful and surprising in the same rich way, it too compels a mentalistic interpretation. Of course, somewhere behind the scenes, there are programmers who, in principle, have a mechanical interpretation. But even for them, that interpretation loses its grip as the working program fills its memory with details too voluminous for them to grasp.
As the rising flood reaches more populated heights, machines will begin to do well in areas a greater number can appreciate. The visceral sense of a thinking presence in machinery will become increasingly widespread. When the highest peaks are covered, there will be machines than can interact as intelligently as any human on any subject. The presence of minds in machines will then become self-evident.
[cf. Sejnowski 1997]