2021-ranganathan.pdf#google: “Warehouse-Scale Video Acceleration: Co–design and Deployment in the Wild”, (2021-02-27; ):
Video sharing (eg., YouTube, Vimeo, Facebook, TikTok) accounts for the majority of internet traffic, and video processing is also foundational to several other key workloads (video conferencing, virtual/augmented reality, cloud gaming, video in Internet-of-Things devices, etc.). The importance of these workloads motivates larger video processing infrastructures and—with the slowing of Moore’s law—specialized hardware accelerators to deliver more computing at higher efficiencies.
This paper describes the design and deployment, at scale, of a new accelerator targeted at warehouse-scale video transcoding. We present our hardware design including a new accelerator building block—the video coding unit (VCU)—and discuss key design trade-offs for balanced systems at data center scale and co-designing accelerators with large-scale distributed software systems. We evaluate these accelerators “in the wild” serving live data center jobs, demonstrating 20–33× improved efficiency over our prior well-tuned non-accelerated baseline. Our design also enables effective adaptation to changing bottlenecks and improved failure management, and new workload capabilities not otherwise possible with systems.
To the best of our knowledge, this is the first work to discuss video acceleration at scale in large warehouse-scale environments.
[Keywords: video transcoding, warehouse-scale computing, domain-specific accelerators, hardware-software co-design]
The VCU package is a full-length PCI-E card and looks a lot like a graphics card. A board has 2 Argos ASIC chips buried under a gigantic, passively cooled aluminum heat sink. There’s even what looks like an 8-pin power connector on the end because PCI-E just isn’t enough power.
Google provided a lovely chip diagram that lists 10 “encoder cores” on each chip, with Google’s white paper adding that “all other elements are off-the-shelf IP blocks.” Google says that “each encoder core can encode 2160p in realtime, up to 60 FPS (frames per second) using 3 reference frames.”
The cards are specifically designed to slot into Google’s warehouse-scale computing system. Each compute cluster in YouTube’s system will house a section of dedicated “VCU machines” loaded with the new cards, saving Google from having to crack open every server and load it with a new card. Google says the cards resemble GPUs because they are what fit in its existing accelerator trays. CNET reports that “thousands of the chips are running in Google data centers right now”, and thanks to the cards, individual video workloads like 4K video “can be available to watch in hours instead of the days it previously took.”
Factoring in the research and development on the chips, Google says this VCU plan will save the company a ton of money, even given the below benchmark showing the TCO (total cost of ownership) of the setup compared to running its algorithm on Intel Skylake chips and Nvidia T4 Tensor core GPUs.
“Security, Moore’s law, and the anomaly of cheap complexity”, (2018-05-29):
CyCon Tallinn 2019, Keynote: Security, Moore’s law, and the anomaly of cheap complexity
I was invited to Keynote CyCon, and my talk was supposed to be right before Bruce Schneier’s talk. I tried hard to make a talk that is accessible to people with a non-technical and non-engineering background, which nonetheless summarized the important things I had learnt about security. The core points are:
- CPUs are much more complex than 20 years ago, the feeling of being overwhelmed by complexity is not an illusion.
- We are sprinkling chips into objects like we are putting salt on food.
- We do this because complexity is cheaper than simplicity. We often use a cheap but complex computer to simulate a much simpler device for cost and convenience.
- The inherent complexity/power of the underlying computer has a tendency to break to the surface as soon as something goes wrong.
- Discrete Dynamical Systems and computers share many properties, and tiny changes have a tendency to cause large changes quickly.
This may be the most polished talk I have ever given—I did multiple dry-runs with different audiences, and bothered everybody and his dog with the slides.
I am particularly proud that Bruce Schneier seemed to have liked it; this is a big thing for me because reading “Applied Cryptography” and “A self-study course in block-cipher cryptanalysis” had a pretty substantial impact on my life.
“The Wheel of Reincarnation”, (2002):
[2002?] Short technology essay based on Myer & Sutherland 1968 (!) discussing a perennial pattern in computing history dubbed the ‘Wheel of Reincarnation’ for how old approaches inevitably reincarnate as the exciting new thing: shifts between ‘local’ and ‘remote’ computing resources, which are exemplified by repeated cycles in graphical display technologies from dumb ‘terminals’ which display only raw pixels to smart devices which interpret more complicated inputs like text or vectors or programming languages (eg PostScript). These cycles are driven by cost, latency, architectural simplicity, and available computing power.
The Wheel of Reincarnation paradigm has played out for computers as well, in shifts from local terminals attached to mainframes to PCs to smartphones to ‘cloud computing’.
“Adversarial Reprogramming of Neural Networks”, (2018-06-28):
Deep neural networks are susceptible to adversarial attacks. In computer vision, well-crafted perturbations to images can cause neural networks to make mistakes such as confusing a cat with a computer. Previous adversarial attacks have been designed to degrade performance of models or cause machine learning models to produce specific outputs chosen ahead of time by the attacker. We introduce attacks that instead reprogram the target model to perform a task chosen by the attacker—without the attacker needing to specify or compute the desired output for each test-time input. This attack finds a single adversarial perturbation, that can be added to all test-time inputs to a machine learning model in order to cause the model to perform a task chosen by the adversary—even if the model was not trained to do this task. These perturbations can thus be considered a program for the new task. We demonstrate adversarial reprogramming on six ImageNet classification models, repurposing these models to perform a counting task, as well as classification tasks: classification of MNIST and CIFAR-10 examples presented as inputs to the model.
Adversarial Reprogramming has demonstrated success in utilizing pre-trained neural network classifiers for alternative classification tasks without modification to the original network. An adversary in such an attack scenario trains an additive contribution to the inputs to repurpose the neural network for the new classification task. While this reprogramming approach works for neural networks with a continuous input space such as that of images, it is not directly applicable to neural networks trained for tasks such as text classification, where the input space is discrete. Repurposing such classification networks would require the attacker to learn an adversarial program that maps inputs from one discrete space to the other. In this work, we introduce a context-based vocabulary remapping model to reprogram neural networks trained on a specific sequence classification task, for a new sequence classification task desired by the adversary. We propose training procedures for this adversarial program in both white-box and black-box settings. We demonstrate the application of our model by adversarially repurposing various text-classification models including LSTM, bi-directional and CNN for alternate classification tasks.
“Inductive Biases for Deep Learning of Higher-Level Cognition”, (2020-11-30):
A fascinating hypothesis is that human and animal intelligence could be explained by a few principles (rather than an encyclopedic list of heuristics). If that hypothesis was correct, we could more easily both understand our own intelligence and build intelligent machines. Just like in physics, the principles themselves would not be sufficient to predict the behavior of complex systems like brains, and substantial computation might be needed to simulate human-like intelligence. This hypothesis would suggest that studying the kind of inductive biases that humans and animals exploit could help both clarify these principles and provide inspiration for AI research and neuroscience theories. Deep learning already exploits several key inductive biases, and this work considers a larger list, focusing on those which concern mostly higher-level and sequential conscious processing. The objective of clarifying these particular principles is that they could potentially help us build AI systems benefiting from humans’ abilities in terms of flexible out-of-distribution and systematic generalization, which is currently an area where a large gap exists between state-of-the-art machine learning and human intelligence.
“Pretrained Transformers as Universal Computation Engines”, (2021-03-09):
We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning—in particular, without finetuning of the self-attention and feedforward layers of the residual blocks. We consider such a model, which we call a Frozen Pretrained Transformer (FPT), and study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. In contrast to works which investigate finetuning on the same modality as the pretraining dataset, we show that pretraining on natural language can improve performance and compute efficiency on non-language downstream tasks. Additionally, we perform an analysis of the architecture, comparing the performance of a random initialized transformer to a random . Combining the two insights, we find language-pretrained transformers can obtain strong performance on a variety of non-language tasks.
What if there is a teacher who knows the learning goal and wants to design good training data for a machine learner? We propose an optimal teaching framework aimed at learners who employ Bayesian models. Our framework is expressed as an optimization problem over teaching examples that balance the future loss of the learner and the effort of the teacher. This optimization problem is in general hard. In the case where the learner employs conjugate exponential family models, we present an approximate algorithm for finding the optimal teaching set. Our algorithm optimizes the aggregate sufficient statistics, then unpacks them into actual teaching examples. We give several examples to illustrate our framework.
2015-zhu.pdf: “Machine Teaching: an Inverse Problem to Machine Learning and an Approach Toward Optimal Education”, (2015-01-01; ):
I draw the reader’s attention to machine teaching, the problem of finding an optimal training set given a machine learning algorithm and a target model. In addition to generating fascinating mathematical questions for computer scientists to ponder, machine teaching holds the promise of enhancing education and personnel training. The Socratic dialogue style aims to stimulate critical thinking.
“Dataset Distillation”, (2018-11-27):
Model distillation aims to distill the knowledge of a complex model into a simpler one. In this paper, we consider an alternative formulation called dataset distillation: we keep the model fixed and instead attempt to distill the knowledge from a large training dataset into a small one. The idea is to synthesize a small number of data points that do not need to come from the correct data distribution, but will, when given to the learning algorithm as training data, approximate the model trained on the original data. For example, we show that it is possible to compress 60,000 MNIST training images into just 10 synthetic distilled images (one per class) and achieve close to original performance with only a few gradient descent steps, given a fixed network initialization. We evaluate our method in various initialization settings and with different learning objectives. Experiments on multiple datasets show the advantage of our approach compared to alternative methods.
Deep neural networks require large training sets but suffer from high computational cost and long training times. Training on much smaller training sets while maintaining nearly the same accuracy would be very beneficial. In the few-shot learning setting, a model must learn a new class given only a small number of samples from that class. One-shot learning is an extreme form of few-shot learning where the model must learn a new class from a single example. We propose the ‘less than one’-shot learning task where models must learn N new classes given only M<N examples and we show that this is achievable with the help of soft labels. We use a soft-label generalization of the k-Nearest Neighbors classifier to explore the intricate decision landscapes that can be created in the ‘less than one’-shot learning setting. We analyze these decision landscapes to derive theoretical lower bounds for separating N classes using M<N soft-label samples and investigate the robustness of the resulting systems.
“On the Design of Display Processors”, (1968-06):
The flexibility and power needed in the channel for a computer display are considered. To work efficiently, such a channel must have a sufficient number of instruction that it is best understood as a small processor rather than a powerful channel. As it was found that successive improvements to the display processor design lie on a circular path, by making improvements one can return to the original simple design plus one new general purpose computer for each trip around. The degree of physical separation between display and parent computer is a key factor in display processor design.
[Keywords: display processor design, display system, computer graphics, graphic terminal, displays, graphics, display generator, display channel, display programming, graphical interaction, remote displays, Wheel of Reincarnation]