notes/Faster (Link Bibliography)

“notes/​Faster” links:

  1. #mcsherry-et-al-2015


  3. 2020-leiserson.pdf: ⁠, Charles E. Leiserson, Neil C. Thompson, Joel S. Emer, Bradley C. Kuszmaul, Butler W. Lampson, Daniel Sanchez, Tao B. Schardl (2020-06-05; cs):

    From bottom to top: The doubling of the number of transistors on a chip every 2 years, a seemly inevitable trend that has been called Moore’s law, has contributed immensely to improvements in computer performance. However, silicon-based transistors cannot get much smaller than they are today, and other approaches should be explored to keep performance growing. Leiserson et al. review recent examples and argue that the most promising place to look is at the top of the computing stack, where improvements in software, algorithms, and hardware architecture can bring the much-needed boost…The miniaturization of semiconductor transistors has driven the growth in computer performance for more than 50 years. As miniaturization approaches its limits, bringing an end to Moore’s law, performance gains will need to come from software, algorithms, and hardware. We refer to these technologies as the “Top” of the computing stack to distinguish them from the traditional technologies at the “Bottom”: semiconductor physics and silicon-fabrication technology. In the post-Moore era, the Top will provide substantial performance gains, but these gains will be opportunistic, uneven, and sporadic, and they will suffer from the law of ⁠. Big system components offer a promising context for tackling the challenges of working at the Top.

    …Unfortunately, semiconductor miniaturization is running out of steam as a viable way to grow computer performance—there isn’t much more room at the “Bottom.” If growth in computing power stalls, practically all industries will face challenges to their productivity. Nevertheless, opportunities for growth in computing performance will still be available, especially at the “Top” of the computing-technology stack: software, algorithms, and hardware architecture.

    Advances: Software can be made more efficient by performance engineering: restructuring software to make it run faster. Performance engineering can remove inefficiencies in programs, known as software bloat, arising from traditional software-development strategies that aim to minimize an application’s development time rather than the time it takes to run. Performance engineering can also tailor software to the hardware on which it runs, for example, to take advantage of parallel processors and vector units.

    Algorithms offer more-efficient ways to solve problems. Indeed, since the late 1970s, the time to solve the maximum-flow problem improved nearly as much from algorithmic advances as from hardware speedups. But progress on a given algorithmic problem occurs unevenly and sporadically and must ultimately face diminishing returns. As such, we see the biggest benefits coming from algorithms for new problem domains (eg., machine learning) and from developing new theoretical machine models that better reflect emerging hardware.

    Hardware architectures can be streamlined—for instance, through processor simplification, where a complex processing core is replaced with a simpler core that requires fewer transistors. The freed-up transistor budget can then be redeployed in other ways—for example, by increasing the number of processor cores running in parallel, which can lead to large efficiency gains for problems that can exploit parallelism. Another form of streamlining is domain specialization, where hardware is customized for a particular application domain. This type of specialization jettisons processor functionality that is not needed for the domain. It can also allow more customization to the specific characteristics of the domain, for instance, by decreasing floating-point precision for machine-learning applications.

    In the post-Moore era, performance improvements from software, algorithms, and hardware architecture will increasingly require concurrent changes across other levels of the stack. These changes will be easier to implement, from engineering-management and economic points of view, if they occur within big system components: reusable software with typically more than a million lines of code or hardware of comparable complexity. When a single organization or company controls a big component, modularity can be more easily re-engineered to obtain performance gains. Moreover, costs and benefits can be pooled so that important but costly changes in one part of the big component can be justified by benefits elsewhere in the same component.

    OUTLOOK: As miniaturization wanes, the silicon-fabrication improvements at the Bottom will no longer provide the predictable, broad-based gains in computer performance that society has enjoyed for more than 50 years. Software performance engineering, development of algorithms, and hardware streamlining at the Top can continue to make computer applications faster in the post-Moore era. Unlike the historical gains at the Bottom, however, gains at the Top will be opportunistic, uneven, and sporadic. Moreover, they will be subject to diminishing returns as specific computations become better explored.

  4. End-to-end

  5. Tool-AI


  7. 2015-mcsherry.pdf: ⁠, Frank McSherry, Michael Isard, Derek G. Murray (2015-05; cs):

    We offer a new metric for big data platforms, COST, or the Configuration that Outperforms a Single Thread. The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation. COST weighs a system’s scalability against the overheads introduced by the system, and indicates the actual performance gains of the system, without rewarding systems that bring substantial but parallelizable overheads.

    We survey measurements of data-parallel systems [for graph processing] recently reported in SOSP and OSDI, and find that many systems have either a surprisingly large COST, often hundreds of cores, or simply underperform one thread for all of their reported configurations.

  8. ⁠, Nick Craver (2013-11-22):

    I like to think of Stack Overflow as running with scale but not at scale. By that I meant we run very efficiently, but I still don’t think of us as “big”, not yet…We like to call it magic, other people call it “multiple servers with multi-core processors”—but we’ll stick with magic. Here’s what runs the Stack Exchange network in that data center:

    • 4 MS SQL Servers
    • 11 IIS Web Servers
    • 2 Redis Servers
    • 3 Tag Engine servers (anything searching by tag hits this, eg. /​​​​​​questions/​​​​​​tagged/​​​​​​c++)
    • 3 Elasticsearch servers
    • 2 Load balancers (HAProxy)
    • 2 Networks (each a Nexus 5596 + Fabric Extenders)
    • 2 Cisco 5525-X ASAs (think Firewall)
    • 2 Cisco 3945 Routers

    …Performance is a feature, a very important feature to us. The main page loaded on all of our sites is the question page, affectionately known as Question/​​​​Show (its route name) internally. On November 12th, that page rendered in an average of 28 milliseconds. While we strive to maintain 50ms, we really try and shave every possible millisecond off your pageload experience. All of our developers are certifiably anal curmudgeons when it comes to performance, so that helps keep times low as well. Here are the other top hit pages on SO, average render time on the same 24 hour period as above:

    • Question/​​​​​Show: 28 ms (29.7 million hits)
    • User Profiles: 39 ms (1.7 million hits)
    • Question List: 78 ms (1.1 million hits)
    • Home page: 65 ms

    …It’s definitely worth noting that these servers run at very low utilization. Those web servers average between 5–15% CPU, 15.5 GB of RAM used and 20–40 Mb/​​​​s network traffic. The SQL servers average around 5–10% CPU, 365 GB of RAM used, and 100–200 Mb/​​​​s of network traffic.

    …The primary reason the utilization is so low is efficient code…Now that we know how Stack Overflow performs on its current hardware, next time we can see why we don’t run in the cloud.



  11. 2013-grace.pdf#miri: ⁠, Katja Grace (2013-12-09; ai⁠, ai  /​ ​​ ​scaling⁠, cs⁠, economics  /​ ​​ ​experience-curve⁠, cs):

    We examine evidence of progress in 6 areas of algorithms research [SAT, chess+Go, factoring, physics simulations, linear programming+scheduling, machine learning], with an eye to understanding likely algorithmic trajectories after the advent of artificial general intelligence. Many of these areas appear to experience fast improvement, though the data are often noisy. For tasks in these areas, gains from algorithmic progress have been roughly 50 to 100% as large as those from hardware progress. Improvements tend to be incremental, forming a relatively smooth curve on the scale of years.

    [See also ⁠, Järvisalo et al 2012; ⁠, Fichte et al 2020]

  12. ⁠, Johannes K. Fichte, Markus Hecher, Stefan Szeider (2020-08-05; economics  /​ ​​ ​experience-curve⁠, cs⁠, ai):

    We compare the impact of hardware advancement and algorithm advancement for SAT solving over the last two decades. In particular, we compare 20-year-old SAT-solvers on new computer hardware with modern SAT-solvers on 20-year-old hardware. Our findings show that the progress on the algorithmic side has at least as much impact as the progress on the hardware side.

  13. Self-decrypting-files#2019-unlocking-of-lcs35

  14. ⁠, hippke (2020-08-05; economics  /​ ​​ ​experience-curve⁠, ai  /​ ​​ ​scaling):

    How can we measure a potential AI or hardware overhang? For the problem of chess, modern algorithms gained two orders of magnitude in compute (or ten years in time) compared to older versions. While it took the supercomputer “Deep Blue” to win over world champion Gary Kasparov in 1997, today’s program achieves the same ELO level on a 486-DX4-100 MHz from 1994. In contrast, the scaling of neural network chess algorithms to slower hardware is worse (and more difficult to implement) compared to classical algorithms. Similarly, future algorithms will likely be able to better leverage today’s hardware by 2–3 orders of magnitude. I would be interested in extending this scaling relation to AI problems other than chess to check its universality.

    …We may wonder: How do modern (better) chess algorithms perform on slower hardware? I tested this with Stockfish version 8 (SF8), one of the strongest classical chess engine. I simulated 10k matches of SF8 against slower versions of itself and a ⁠, using cutechess-cli. In these benchmarks, I varied the total number of nodes to be searched during each game. I kept the RAM constant (this may be unrealistic for very old machines, see below). By assuming a fixed thinking time per game, the experiments scale out to slower machines. By cross-correlating various old benchmarks of Stockfish and other engines on older machines, I matched these ratings to units of MIPS; and finally, MIPS approximately to the calendar year. Depending on the actual release dates of the processors, the year axis has a jitter up to 2 years. I estimate the error for the compute estimates to be perhaps 20%, and certainly less than 50%. As we will see, the results measure in orders of magnitude, so that these errors are small in comparison (<10%).

    Results: SF8 achieves Kasparov’s 2850 ELOs running on a 486-100 MHz introduced in 1994, three years before the Kasparov-Deep Blue match. These ELOs refer to tournament conditions as in the 1997 IBM games. In other words, with today’s algorithms, computers would have beat the world chess champion already in 1994 on a contemporary desk computer (not a supercomputer). [Followup: “A closer look at chess scalings (into the past)”>⁠/​​​​“Benchmarking an old chess engine on new hardware”: “How much more compute does Stockfish 3 require to match Stockfish 13? Answer: 32× (uncertainty: 30–35×)…Interpretation: If we accept SF as amongst the very best chess programs in the last decade, we can make a more general assessment of chess compute vs. algorithm. Compute explains 30–50% of the computer chess ELO progress; algorithm improvements explain 50–70%.”]

    Chess Elo vs MIPS, 1990–2020
  15. 2007-drepper.pdf: ⁠, Ulrich Drepper (2007-11-12; cs):

    As CPU cores become increasingly faster, the limiting factor for most programs is now, and will be for some time, memory access. Hardware designers have come up with ever-more sophisticated memory-handling and memory-acceleration techniques—such as CPU caches—but these cannot work optimally without some help from the programmer. Unfortunately, neither the structure nor the cost of using the memory subsystem of a computer or the caches on CPU is well understood by most programmers. This paper explains the structure of memory subsystems in use on modern commodity hardware, illustrating why CPU caches were developed, how they work, and what programs should do to achieve optimal performance by utilizing them.

    [Some parts are obsolete as of 2017.]

  16. ⁠, Poul-Henning Kamp (2010-06-11):

    [Discussion of optimization (PHK) made to ⁠, switching to an obscurer data structure which has better cache performance. This switch singlehandedly made a highly-optimized piece of software a good third faster.

    Varnish is a web server technology which is used to store millions or billions of web pages/​​​​files in RAM to serve them ultra-fast, instead of falling back to a regular web server which may need to pull a file off the disk or use laborious computations to create the necessary web page. This is a straightforward high-volume task which should be bottlenecked on network and CPU—copying parts of RAM to the network as fast as possible.

    PHK observed that Varnish server CPUs were not a bottleneck, and almost entirely idle. Why? Because it was spending most of its time tracking what files were in RAM, to make sure the copies weren’t too old & obsolete. This tracking was conveniently & efficiently implemented as a ⁠.

    Unfortunately, a Varnish server wants to spend all its memory on the actual files it’s there to send, forcing much of the binary heap to be stashed on disk. This is bad because the binary heap assumes that all of it is in memory and optimizes for number of queries, spreading itself across a lot of memory—except because it is getting pushed out to disk, that means that each lookup may require going out to disk. So while the binary heap requires the minimum number of queries to figure out what objects are obsolete, each query is a worst-case query, unexpectedly taking milliseconds or entire seconds!

    The solution is to take a binary heap and rearrange it into a to keep nearby entries on the same part of disk, instead of scattered across the disk. That way, when Varnish is forced to go out to disk to check obsoleteness, it at least only has to do a few queries on a single part of the disk which can be read en masse, and can get back to useful work more quickly.]



  19. ⁠, Michael Axtmann, Sascha Witt, Daniel Ferizovic, Peter Sanders (2020-09-28):

    We present sorting algorithms that represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. A part of the speed advantage is due to the feature to work in-place. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account. Our comparison-based algorithm, In-place Superscalar Samplesort (IPS4o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best in-place parallel comparison-based competitor by almost a factor of three. IPS4o also outperforms the best comparison-based competitors in the in-place or not in-place, parallel or sequential settings. IPS4o even outperforms the best integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new in-place radix sorter turns out to be the best algorithm. Claims to have the, in some sense, “best” sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on extensive experiments involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications.

  20. Sparsity





  25. 1985-michie.pdf: “Human Window on the World”⁠, Donald Michie

  26. ⁠, Aleksei Petrenko, Zhehui Huang, Tushar Kumar, Gaurav Sukhatme, Vladlen Koltun (2020-06-21):

    Increasing the scale of experiments has allowed researchers to achieve unprecedented results in both training sophisticated agents for video games, and in sim-to-real transfer for robotics. Typically such experiments rely on large distributed systems and require expensive hardware setups, limiting wider access to this exciting area of research. In this work we aim to solve this problem by optimizing the efficiency and resource utilization of reinforcement learning algorithms instead of relying on distributed computation. We present the “Sample Factory”, a high-throughput training system optimized for a single-machine setting. Our architecture combines a highly efficient, asynchronous, -based sampler with off-policy correction techniques, allowing us to achieve throughput higher than 105 environment frames/​​​​second on non-trivial control problems in 3D without sacrificing sample efficiency. We extend Sample Factory to support self-play and population-based training and apply these techniques to train highly capable agents for a multiplayer first-person shooter game. The source code is available at https:/​​​​/​​​​​​​​alex-petrenko/​​​​sample-factory

  27. ⁠, Aleksei Petrenko, Erik Wijmans, Brennan Shacklett, Vladlen Koltun (2021-07-17):

    We present Megaverse, a new 3D simulation platform for reinforcement learning and embodied AI research. The efficient design of our engine enables physics-based simulation with high-dimensional egocentric observations at more than 1,000,000 actions per second on a single 8-GPU node. Megaverse is up to 70× faster than in fully-shaded 3D scenes with interactive objects. We achieve this high simulation performance by leveraging batched simulation, thereby taking full advantage of the massive parallelism of modern GPUs. We use Megaverse to build a new benchmark that consists of several single-agent and multi-agent tasks covering a variety of cognitive challenges. We evaluate model-free RL on this benchmark to provide baselines and facilitate future research. The source code is available at https:/​​​​/​​​​

  28. ⁠, C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, Olivier Bachem (2021-06-24):

    We present Brax, an open source library for rigid body simulation with a focus on performance and parallelism on accelerators, written in JAX. We present results on a suite of tasks inspired by the existing reinforcement learning literature, but remade in our engine. Additionally, we provide reimplementations of ⁠, ⁠, ES, and direct policy optimization in JAX that compile alongside our environments, allowing the learning algorithm and the environment processing to occur on the same device, and to scale seamlessly on accelerators. Finally, we include notebooks that facilitate training of performant policies on common Gym MuJoCo-like tasks in minutes.

  29. ⁠, Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, Gavriel State (2021-08-24):

    Isaac Gym offers a high performance learning platform to train policies for wide variety of robotics tasks directly on GPU. Both physics simulation and the neural network policy training reside on GPU and communicate by directly passing data from physics buffers to PyTorch tensors without ever going through any CPU bottlenecks. This leads to blazing fast training times for complex robotics tasks on a single GPU with 1–2 orders of magnitude improvements compared to conventional RL training that uses a CPU based simulator and GPU for neural networks. We host the results and videos at https:    /​ ​​ ​​ ​​ ​    /​ ​​ ​​ ​​ ​    /​ ​​ ​​ ​​ ​view    /​ ​​ ​​ ​​ ​isaacgym-nvidia and isaac gym can be download at https:    /​ ​​ ​​ ​​ ​    /​ ​​ ​​ ​​ ​    /​ ​​ ​​ ​​ ​isaac-gym⁠.

  30. ⁠, Tian Lan, Sunil Srinivasa, Stephan Zheng (2021-08-31):

    Deep reinforcement learning (RL) is a powerful framework to train decision-making models in complex dynamical environments. However, RL can be slow as it learns through repeated interaction with a simulation of the environment. Accelerating RL requires both algorithmic and engineering innovations. In particular, there are key systems engineering bottlenecks when using RL in complex environments that feature multiple agents or high-dimensional state, observation, or action spaces, for example. We present WarpDrive, a flexible, lightweight, and easy-to-use open-source RL framework that implements end-to-end multi-agent RL on a single GPU (Graphics Processing Unit), building on PyCUDA and PyTorch. Using the extreme parallelization capability of GPUs, WarpDrive enables orders-of-magnitude faster RL compared to common implementations that blend CPU simulations and GPU models. Our design runs simulations and the agents in each simulation in parallel. It eliminates data copying between CPU and GPU. It also uses a single simulation data store on the GPU that is safely updated in-place. Together, this allows the user to run thousands of concurrent multi-agent simulations and train on extremely large batches of experience. For example, WarpDrive yields 2.9 million environment steps/​​​​second with 2000 environments and 1000 agents (at least 100× higher throughput compared to a CPU implementation) in a benchmark Tag simulation. WarpDrive provides a lightweight Python interface and environment wrappers to simplify usage and promote flexibility and extensions. As such, WarpDrive provides a framework for building high-throughput RL systems.

  31. ⁠, Brennan Shacklett, Erik Wijmans, Aleksei Petrenko, Manolis Savva, Dhruv Batra, Vladlen Koltun, Kayvon Fatahalian (2021-03-12):

    We accelerate deep reinforcement learning-based training in visually complex 3D environments by two orders of magnitude over prior work, realizing end-to-end training speeds of over 19,000 frames of experience per second on a single GPU and up to 72,000 frames per second on a single 8-GPU machine.

    The key idea of our approach is to design a 3D renderer and embodied navigation simulator around the principle of “batch simulation”: accepting and executing large batches of requests simultaneously. Beyond exposing large amounts of work at once, batch simulation allows implementations to amortize in-memory storage of scene assets, rendering work, data loading, and synchronization costs across many simulation requests, dramatically improving the number of simulated agents per GPU and overall simulation throughput.

    To balance DNN inference and training costs with faster simulation, we also build a computationally efficient policy DNN that maintains high task performance, and modify training algorithms to maintain sample efficiency when training with large mini-batches.

    By combining batch simulation and DNN performance optimizations, we demonstrate that PointGoal navigation agents can be trained in complex 3D environments on a single GPU in 1.5 days to 97% of the accuracy of agents trained on a state-of-the-art system using a 64-GPU cluster over 3 days. We provide open-source reference implementations of our batch 3D renderer and simulator to facilitate incorporation of these ideas into RL systems.

  32. ⁠, Andy L. Jones (2021-04-07):

    The largest experiments in machine learning now require resources far beyond the budget of all but a few institutions. Fortunately, it has recently been shown that the results of these huge experiments can often be extrapolated from the results of a sequence of far smaller, cheaper experiments. In this work, we show that not only can the extrapolation be done based on the size of the model, but on the size of the problem as well.

    By conducting a sequence of experiments using and ⁠, we show that the performance achievable with a fixed amount of compute degrades predictably as the game gets larger and harder. Along with our main result, we further show that the test-time and train-time compute available to an agent can be traded off while maintaining performance.

    Figure 5: Each training run (each faint line) of each differently-sized agent follows a sigmoid, starting at random play and progressing up to some plateau. The frontiers (dark lines) formed by taking a maximum across training runs have a similar form across board sizes (colors).
    Figure 6: The compute-performance frontier follows the same sigmoid for each board size 3 through 9, just scaled and shifted. The dotted lines give the fitted curves.
    1. Slope: The slope of the incline is 500 per order of magnitude increase in compute.

      A more memorable interpretation is that if you are in the linearly-increasing regime, then you will need about 2× as much compute as your opponent to beat them 2⁄3 of the time.

    1. Perfect play: The minimum compute needed for perfect play increases 7× for each increment in board size.

    2. Takeoff: The minimum training compute needed to see any improvement over random play increases by 4× for each increment of board size.

    3. Random play: Finally, the distance between random play and perfect play increases by 500 Elo for each increment of board size.

      Unlike the other quantities mentioned previously, the distance between random and perfect play is a property of the game itself rather than of the agent.

    Train-test trade-off: So far we have focused on the compute budget during training, but another pertinent budget is the compute spent during evaluation. All the results discussed previously have used a tree search of size 64 during evaluation, the same as used during training. But there is no reason that the train-time search and test-time search have to be the same size, and so by varying the size of the test-time compute budget we can see in Figure 8 that larger tree searches at test time can substantially improve the performance of an agent.

    Knowing now that compute can be spent in 2 places, at train time and test time, the immediate question is: how do these 2 budgets trade off? This is illustrated in Figure 9, which shows that the trade-off is linear in log-compute: for each additional 10× of train-time compute, about 15× of test-time compute can be eliminated, down to a floor of a single-node tree search…the simple relationship between compute at train time and compute at test time was originally surprising to us. Our intuition was that test-time compute is much ‘cheaper’ than train-time compute, and so we were surprised that one could easily substitute for the other. On reflection however, we believe the key distinction is that an optimization at test-time needs only optimise over one sample, while train-time compute meanwhile must optimise over the entire distribution of samples.

    Figure 9: The trade-off between train-time compute and test-time compute. Each dotted line gives the minimum train-test compute required for a certain Elo on a 9 × 9 board.

    …the way in which performance scales with compute is that an agent with twice as much compute as its opponent can win roughly 2⁄3 of the time. This behaviour is strikingly similar to that of a toy model where each player chooses as many random numbers as they have compute, and the player with the highest number wins3. In this toy model, doubling your compute doubles how many random numbers you draw, and the probability that you possess the largest number is 2⁄3 [as you go from 1:1, half the total numbers drawn, to 2:1, or 2/​​​​(2+1)—as if each tree search were an independent lottery ticket]. This suggests that the complex game play of Hex might actually reduce to each agent having a ‘pool’ of strategies proportional to its compute, and whoever picks the better strategy wins. While on the basis of the evidence presented herein we can only consider this to be serendipity, we are keen to see whether the same behaviour holds in other games.

    Second, both the relation of performance to board size and the relation of performance to compute are smooth. Before embarking on this project, a key unknown was whether performance would show any ‘spikes’ with regards to compute or board size. A spike with regards to compute might indicate the model had achieved some key insight, while a spike with regards to board size might indicate a minimum complexity past which key insights are available for the model to discover. As is however, models’ performance changes smoothly and predictably with both increased compute and increased complexity.


  34. ⁠, Sam Greydanus (2020-12-01):

    …Yet in spite of its historical importance, MNIST has three notable shortcomings. First, it does a poor job of differentiating between linear, nonlinear, and translation-invariant models. For example, logistic, MLP, and benchmarks obtain 94, 99+, and 99+% accuracy on it. This makes it hard to measure the contribution of a CNN’s spatial priors or to judge the relative effectiveness of different regularization schemes. Second, it is somewhat large for a toy dataset. Each input example is a 784-dimensional vector and thus it takes a non-trivial amount of computation to perform hyperparameter searches or debug a meta-learning loop. Third, MNIST is hard to hack. The ideal toy dataset should be procedurally generated so that researchers can smoothly vary parameters such as background noise, translation, and resolution.

    In order to address these shortcomings, we propose the MNIST-1D dataset. It is a minimalist, low-memory, and low-compute alternative to MNIST, designed for exploratory deep learning research where rapid iteration is a priority. Training examples are 20 times smaller but they are still better at measuring the difference between (1) linear and nonlinear classifiers and (2) models with and without spatial inductive biases (eg. translation invariance). The dataset is procedurally generated but still permits analogies to real-world digit classification…Unlike MNIST, each example is a one-dimensional sequence of points. To generate an example, we begin with a digit template and then randomly pad, translate, and transform it.

    Example use cases: In this section we will explore several examples of how MNIST-1D can be used to study core “science of deep learning” phenomena.

    1. Finding lottery tickets…Unlike many follow-up experiments on the lottery ticket, this one took just two days of researcher time to produce. The curious reader can also reproduce these results in their browser in a few minutes.

    2. Observing deep double descent…We see the MNIST-1D dataset as a good tool for exploring these properties. In fact, we were able to reproduce the pattern after a few hours of researcher effort. The figure below shows our results for a fully-connected network and a convolutional model.

    3. Gradient-based meta-learning…A model does this by having two levels of optimization: the first is a fast inner loop which corresponds to a traditional learning objective and second is a slow outer loop which updates the “meta” properties of the learning process…Meta-learning is a promising topic but it is very difficult to scale. First of all, meta-learning algorithms consume enormous amounts of time and compute. Second of all, implementations tend to grow complex since there are twice as many hyperparameters (one set for each level of optimization) and most deep learning frameworks are not set up well for meta-learning. This places an especially high incentive on debugging and iterating meta-learning algorithms on small-scale datasets such as MNIST-1D. For example, it took just a few hours to implement and debug the gradient-based hyperparameter optimization of a learning rate shown below.

      • Meta-learning an activation function: Having implemented a “minimal working example” of gradient-based meta-learning, we realized that it permitted a simple and novel extension: meta-learning an activation function. With a few more hours of researcher time, we were able to parameterize our classifier’s activation function with a second neural network and then learn the weights using meta-gradients.
    4. Measuring the spatial priors of deep networks: …Principle among these priors is the translation invariance of convolution. A primary motivation for this dataset was to construct a toy problem that could effectively quantify a model’s spatial priors. The second figure in this post illustrates that this is indeed possible with MNIST-1D.

    5. Benchmarking pooling methods. Our final case study begins with a specific question: What is the relationship between pooling and sample efficiency? We had not seen evidence that pooling makes models more or less sample efficient, but this seemed an important relationship to understand. With this in mind, we trained models with different pooling methods and training set sizes and found that, while pooling tended to be effective in low-data regimes, it did not make much of a difference in high-data regimes.

    …this post argues in favor of small-scale machine learning research. Neural networks do not have problems with scaling or performance—but they do have problems with interpretability, reproducibility, and iteration speed. We see carefully-controlled, small-scale experiments as a great way to address these problems…For example, several of the findings reported in this post are at the point where they should be investigated at scale. We would like to show that large scale lottery tickets also learn spatial inductive biases, and show evidence that they develop local connectivity. We would also like to try meta-learning an activation function on a larger model in the hopes of finding an activation that will outperform and Swish in generality. We should emphasize that we are only ready to scale these results now that we have isolated and understood them in a controlled setting. We believe that scaling a system is only a good idea once the relevant causal mechanisms have been isolated and understood. [cf scaling law papers] …Our work also bears philosophical similarities to the ⁠.

    Closing Thoughts: There is a counterintuitive possibility that in order to explore the limits of how large we can scale neural networks, we may need to explore the limits of how small we can scale them first. Scaling models and datasets downward in a way that preserves the nuances of their behaviors at scale will allow researchers to iterate quickly on fundamental and creative ideas. This fast iteration cycle is the best way of obtaining insights about how to incorporate progressively more complex inductive biases into our models. We can then transfer these inductive biases across spatial scales in order to dramatically improve the sample efficiency and generalization properties of large-scale models. We see the humble MNIST-1D dataset as a first step in that direction.

  35. ⁠, Matteo Hessel, Manuel Kroiss, Aidan Clark, Iurii Kemaev, John Quan, Thomas Keck, Fabio Viola, Hado van Hasselt (2021-04-13):

    Supporting state-of-the-art AI research requires balancing rapid prototyping, ease of use, and quick iteration, with the ability to deploy experiments at a scale traditionally associated with production systems. Deep learning frameworks such as TensorFlow, PyTorch and JAX allow users to transparently make use of accelerators, such as and GPUs, to offload the more computationally intensive parts of training and inference in modern deep learning systems. Popular training pipelines that use these frameworks for deep learning typically focus on (un-)supervised learning. How to best train reinforcement learning (RL) agents at scale is still an active research area.

    In this report we argue that TPUs are particularly well suited for training RL agents in a scalable, efficient and reproducible way. Specifically we describe two architectures designed to make the best use of the resources available on a TPU Pod (a special configuration in a Google data center that features multiple TPU devices connected to each other by extremely low latency communication channels).

    Figure 4: (a) FPS for Anakin, as a function of the number of TPU cores, ranging from 16 (ie. 2 replicas) to 128 (ie. 16 replicas). (b) FPS for a Sebulba implementation of IMPALA’s V-trace algorithm, as a function of the actor batch size, from 32 (as in IMPALA) to 128. (c) FPS for a Sebulba implementation of MuZero, as a function of the number of TPU cores, from 16 (ie. 2 replicas) to 128 (ie. 16 replicas).

    Anakin: When using small neural networks and grid-world environments an Anakin architecture can easily perform 5 million steps per second, even on the 8-core TPU accessible for free through Google Colab.

    This can be very useful to experiment and debug research ideas in the friendly Colab environment. In Figure 4a we show how, thanks to the efficient network connecting different TPU cores in a Pod, performance scales almost linearly with the number of cores; the collective operations used to average gradients across replicas appear to cause only minimal overhead.

    In a recent paper by Anakin was used, at a much larger scale, to discover a general reinforcement learning update, from experience of interacting with a rich set of environments implemented in JAX. In this paper, Anakin was used to learn a single shared update rule from 60K JAX environments and 1K policies running and training in parallel.

    Despite the complex nature of the system, based on the use of neural networks to meta-learn not just a policy but the entire RL update, Anakin delivered over 3 million steps per second on a 16-core TPU. Training the update rule to a good level of performance, required running Anakin for approximately 24 hours; this would cost approximately $100 on GCP’s preemptible instances

    Sebulba: Our second podracer architecture has also been extensively used for exploring a variety of RL ideas at scale, on environments that cannot be compiled to run on TPU (eg. Atari, DMLab and MuJoCo). As both IMPALA and Sebulba are based on a decomposition between actors and learners, agents designed for the IMPALA architecture can be easily mapped onto Sebulba; for instance a Podracer version of the V-trace agent easily reproduced the results from Espeholt et al 2018. However, we found that training an agent for 200 million frames of an Atari game could be done in just ~1 hour, by running Sebulba on a 8-core TPU. This comes at a cost of approximately $2.88, on GCP’s pre-emptible instances. This is similar in cost to training with the more complex framework, and much cheaper than training an agent for 200 million Atari frames using either IMPALA or single-stream GPU-based system such as that traditionally used by ⁠.

    …In addition to the trajectory length the effective batch size used to compute each update also depends on how many times we replicate the basic 8-TPU setup. Sebulba also scales effectively along this dimension: using 2048 TPU cores (an entire Pod) we were able to further scale all the way to 43 million frames per second, solving the classic Atari videogame Pong in less than 1 minute…Sebulba has also been used to train search-based agents inspired by MuZero (Schrittwieser et al 2020). The workloads associated to these agents are very different from that of model-free agents like IMPALA. The key difference is in the cost of action selection. This increases because MuZero’s policy combines search with deep neural networks (used to guide and/​​​​or truncate the search). Typically, search-based agents like MuZero required custom C++ implementations of the search to deliver good performance. We could reproduce results from MuZero (no Reanalyse) on multiple RL environments, using Sebulba and a pure JAX implementation of ⁠. Training a MuZero agent with Sebulba for 200M Atari frames takes 9 hours on a 16-core TPU (at a cost of ~$40 on GCP’s preemptible instances).

    We found that scalability, via replication, was particularly useful in this context. Figure 4c reports the number of frames per seconds processed by Sebulba when running MuZero on Atari, as a function of the number of TPU cores. The throughput increased linearly with the number of cores.


  37. ⁠, Stuart Cheshire (2001):

    [Seminal essay explaining why the rollout of “broadband” home connections to replace 56k dialups had not improved regular WWW browsing as much as people expected: while broadband had greater throughput, it had similar (or worse) latency.

    Because much of the wallclock time of any Internet connection is spent setting up and negotiating with the other end, and not that much is spent on the raw transfer of large numbers of bytes, the speedup is far smaller than one would expect by dividing the respective bandwidths.

    Further, while bandwidth/​​​​throughput is easy to improve by adding more or higher-quality connections and can be patched elsewhere in the system by adding parallelism or upgrading parts or investing in data compression, the latency-afflicted steps are stubbornly serial, any time lost is physically impossible to retrieve, and many steps are inherently limited by the speed of light—more capacious connections quickly run into ⁠, where the difficult-to-improve serial latency-bound steps dominate the overall task. As Cheshire summarizes it:]

    1. Fact One: Making more bandwidth is easy.
    2. Fact Two: Once you have bad latency you’re stuck with it.
    3. Fact Three: Current consumer devices have appallingly bad latency.
    4. Fact Four: Making limited bandwidth go further is easy.

    …That’s the problem with communications devices today. Manufacturers say “speed” when they mean “capacity”. The other problem is that as far as the end-user is concerned, the thing they want to do is transfer large files quicker. It may seem to make sense that a high-capacity slow link might be the best thing for the job. What the end-user doesn’t see is that in order to manage that file transfer, their computer is sending dozens of little control messages back and forth. The thing that makes computer communication different from television is interactivity, and interactivity depends on all those little back-and-forth messages.

    The phrase “high-capacity slow link” that I used above probably looked very odd to you. Even to me it looks odd. We’ve been used to wrong thinking for so long that correct thinking looks odd now. How can a high-capacity link be a slow link? High-capacity means fast, right? It’s odd how that’s not true in other areas. If someone talks about a “high-capacity” oil tanker, do you immediately assume it’s a very fast ship? I doubt it. If someone talks about a “large-capacity” truck, do you immediately assume it’s faster than a small sports car?

    We have to start making that distinction again in communications. When someone tells us that a modem has a speed of 28.8 kbit/​​​​sec we have to remember that 28.8 kbit/​​​​sec is its capacity, not its speed. Speed is a measure of distance divided by time, and ‘bits’ is not a measure of distance.

    I don’t know how communications came to be this way. Everyone knows that when you buy a hard disk you should check what its seek time is. The maximum transfer rate is something you might also be concerned with, but the seek time is definitely more important. Why does no one think to ask what a modem’s ‘seek time’ is? The latency is exactly the same thing. It’s the minimum time between asking for a piece of data and getting it, just like the seek time of a disk, and it’s just as important.

  38. ⁠, Dan Luu (2017-12):

    I’ve had this nagging feeling that the computers I use today feel slower than the computers I used as a kid. As a rule, I don’t trust this kind of feeling because human perception has been shown to be unreliable in empirical studies, so I carried around a high-speed camera and measured the response latency of devices I’ve run into in the past few months. These are tests of the latency between a keypress and the display of a character in a terminal (see appendix for more details)…If we look at overall results, the fastest machines are ancient. Newer machines are all over the place. Fancy gaming rigs with unusually high refresh-rate displays are almost competitive with machines from the late 70s and early 80s, but “normal” modern computers can’t compete with thirty to forty year old machines.

    …Almost every computer and mobile device that people buy today is slower than common models of computers from the 70s and 80s. Low-latency gaming desktops and the iPad Pro can get into the same range as quick machines from thirty to forty years ago, but most off-the-shelf devices aren’t even close.

    If we had to pick one root cause of latency bloat, we might say that it’s because of “complexity”. Of course, we all know that complexity is bad. If you’ve been to a non-academic non-enterprise tech conference in the past decade, there’s a good chance that there was at least one talk on how complexity is the root of all evil and we should aspire to reduce complexity.

    Unfortunately, it’s a lot harder to remove complexity than to give a talk saying that we should remove complexity. A lot of the complexity buys us something, either directly or indirectly. When we looked at the input of a fancy modern keyboard vs. the Apple 2 keyboard, we saw that using a relatively powerful and expensive general purpose processor to handle keyboard inputs can be slower than dedicated logic for the keyboard, which would both be simpler and cheaper. However, using the processor gives people the ability to easily customize the keyboard, and also pushes the problem of “programming” the keyboard from hardware into software, which reduces the cost of making the keyboard. The more expensive chip increases the manufacturing cost, but considering how much of the cost of these small-batch artisanal keyboards is the design cost, it seems like a net win to trade manufacturing cost for ease of programming.

  39. ⁠, Dan Luu (2017-10-16; technology⁠, cs⁠, design):

    [Dan Luu continues his investigation of why computers feel so laggy and have such high latency compared to old computers (⁠, ⁠, ⁠, cf text editor analysis).

    He measures 21 keyboard latencies using a logic analyzer, finding a range of 15–60ms (!), representing a waste of a large fraction of the available ~100–200ms latency budget before a user notices and is irritated (“the median keyboard today adds as much latency as the entire end-to-end pipeline as a fast machine from the 70s.”). The latency estimates are surprising, and do not correlate with advertised traits. They simply have to be measured empirically.]

    We can see that, even with the limited set of keyboards tested, there can be as much as a 45ms difference in latency between keyboards. Moreover, a modern computer with one of the slower keyboards attached can’t possibly be as responsive as a quick machine from the 70s or 80s because the keyboard alone is slower than the entire response pipeline of some older computers. That establishes the fact that modern keyboards contribute to the latency bloat we’ve seen over the past forty years…Most keyboards add enough latency to make the user experience noticeably worse, and keyboards that advertise speed aren’t necessarily faster. The two gaming keyboards we measured weren’t faster than non-gaming keyboards, and the fastest keyboard measured was a minimalist keyboard from Apple that’s marketed more on design than speed.

  40. ⁠, Dan Luu (2017-07-18):

    These graphs show the distribution of latencies for each terminal. The y-axis has the latency in milliseconds. The x-axis is the percentile (eg., 50 means represents 50%-ile keypress ie., the median keypress). Measurements are with macOS unless otherwise stated. The graph on the left is when the machine is idle, and the graph on the right is under load. If we just look at median latencies, some setups don’t look too bad— and emacs-eshell are at roughly 5ms unloaded, small enough that many people wouldn’t notice. But most terminals (st, alacritty, hyper, and iterm2) are in the range where you might expect people to notice the additional latency even when the machine is idle. If we look at the tail when the machine is idle, say the 99.9%-ile latency, every terminal gets into the range where the additional latency ought to be perceptible, according to studies on user interaction. For reference, the internally generated keypress to GPU memory trip for some terminals is slower than the time it takes to send a packet from Boston to Seattle and back, about 70ms.

    …Most terminals have enough latency that the user experience could be improved if the terminals concentrated more on latency and less on other features or other aspects of performance. However, when I search for terminal benchmarks, I find that terminal authors, if they benchmark anything, benchmark the speed of sinking stdout or memory usage at startup. This is unfortunate because most “low performance” terminals can already sink stdout many orders of magnitude faster than humans can keep up with, so further optimizing stdout throughput has a relatively small impact on actual user experience for most users. Likewise for reducing memory usage when an idle terminal uses 0.01% of the memory on my old and now quite low-end laptop. If you work on a terminal, perhaps consider relatively more latency and interactivity (eg., responsiveness to ^C) optimization and relatively less throughput and idle memory usage optimization.

  41. ⁠, Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, Mark McGranaghan (Ink & Switch) (2019-04):

    [PDF version]

    Cloud apps like Google Docs and Trello are popular because they enable real-time collaboration with colleagues, and they make it easy for us to access our work from all of our devices. However, by centralizing data storage on servers, cloud apps also take away ownership and agency from users. If a service shuts down, the software stops functioning, and data created with that software is lost.

    In this article we propose “local-first software”: a set of principles for software that enables both collaboration and ownership for users. Local-first ideals include the ability to work offline and collaborate across multiple devices, while also improving the security, privacy, long-term preservation, and user control of data.

    We survey existing approaches to data storage and sharing, ranging from email attachments to web apps to Firebase-backed mobile apps, and we examine the trade-offs of each. We look at Conflict-free Replicated Data Types (CRDTs): data structures that are multi-user from the ground up while also being fundamentally local and private. CRDTs have the potential to be a foundational technology for realizing local-first software.

    We share some of our findings from developing local-first software prototypes at Ink & Switch over the course of several years. These experiments test the viability of CRDTs in practice, and explore the user interface challenges for this new data model. Lastly, we suggest some next steps for moving towards local-first software: for researchers, for app developers, and a startup opportunity for entrepreneurs.

    …in the cloud, ownership of data is vested in the servers, not the users, and so we became borrowers of our own data. The documents created in cloud apps are destined to disappear when the creators of those services cease to maintain them. Cloud services defy long-term preservation. No Wayback Machine can restore a sunsetted web application. The cannot preserve your Google Docs.

    In this article we explored a new way forward for software of the future. We have shown that it is possible for users to retain ownership and control of their data, while also benefiting from the features we associate with the cloud: seamless collaboration and access from anywhere. It is possible to get the best of both worlds.

    But more work is needed to realize the local-first approach in practice. Application developers can take incremental steps, such as improving offline support and making better use of on-device storage. Researchers can continue improving the algorithms, programming models, and user interfaces for local-first software. Entrepreneurs can develop foundational technologies such as CRDTs and peer-to-peer networking into mature products able to power the next generation of applications.

    • Motivation: collaboration and ownership

    • Seven ideals for local-first software

      • No spinners: your work at your fingertips
      • Your work is not trapped on one device
      • The network is optional
      • Seamless collaboration with your colleagues
      • The Long Now
      • Security and privacy by default
      • You retain ultimate ownership and control
    • Existing data storage and sharing models

      • How application architecture affects user experience
        • Files and email attachments
        • Web apps: Google Docs, Trello, Figma
        • Dropbox, Google Drive, Box, OneDrive, etc.
        • Git and GitHub
      • Developer infrastructure for building apps
        • Web app (thin client)
        • Mobile app with local storage (thick client)
        • Backend-as-a-Service: Firebase, CloudKit, Realm
        • CouchDB
    • Towards a better future

      • CRDTs as a foundational technology
      • Ink & Switch prototypes
        • Trello clone
        • Collaborative drawing
        • Media canvas
        • Findings
      • How you can help
        • For distributed systems and programming languages researchers
        • For Human-Computer Interaction (HCI) researchers
        • For practitioners
        • Call for startups
    • Conclusions

  42. 2019-kleppmann.pdf: ⁠, Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, Mark McGranaghan (Ink & Switch) (2019-10-23; cs):

    Cloud apps like Google Docs and Trello are popular because they enable real-time collaboration with colleagues, and they make it easy for us to access our work from all of our devices. However, by centralizing data storage on servers, cloud apps also take away ownership and agency from users. If a service shuts down, the software stops functioning, and data created with that software is lost.

    In this article we propose local-first software, a set of principles for software that enables both collaboration and ownership for users. Local-first ideals include the ability to work offline and collaborate across multiple devices, while also improving the security, privacy, long-term preservation, and user control of data.

    We survey existing approaches to data storage and sharing, ranging from email attachments to web apps to Firebase-backed mobile apps, and we examine the trade-offs of each. We look at Conflict-free Replicated Data Types (CRDTs): data structures that are multi-user from the ground up while also being fundamentally local and private. CRDTs have the potential to be a foundational technology for realizing local-first software.

    We share some of our findings from developing local-first software prototypes at the Ink & Switch research lab over the course of several years. These experiments test the viability of CRDTs in practice, and explore the user interface challenges for this new data model. Lastly, we suggest some next steps for moving towards local-first software: for researchers, for app developers, and a startup opportunity for entrepreneurs.

    [Keywords: collaboration software, mobile computing, data ownership, CRDTs, peer-to-peer communication]

  43. ⁠, Dan Luu (2017-02-08):

    A couple years ago, I took a road trip from Wisconsin to Washington and mostly stayed in rural hotels on the way. I expected the internet in rural areas too sparse to have cable internet to be slow, but I was still surprised that a large fraction of the web was inaccessible. Some blogs with lightweight styling were readable, as were pages by academics who hadn’t updated the styling on their website since 1995. But very few commercial websites were usable (other than Google). When I measured my connection, I found that the bandwidth was roughly comparable to what I got with a 56k modem in the 90s. The latency and packet loss were substantially worse than the average day on dialup: latency varied between 500ms and 1000ms and packet loss varied between 1% and 10%. Those numbers are comparable to what I’d see on dialup on a bad day.

    Despite my connection being only a bit worse than it was in the 90s, the vast majority of the web wouldn’t load…When Microsoft looked at actual measured connection speeds, they found that half of Americans don’t have broadband speed. Heck, AOL had 2 million dial-up subscribers in 2015, just AOL alone. Outside of the U.S., there are even more people with slow connections. I recently chatted with Ben Kuhn, who spends a fair amount of time in Africa, about his internet connection:

    I’ve seen ping latencies as bad as ~45 sec and packet loss as bad as 50% on a mobile hotspot in the evenings from Jijiga, Ethiopia. (I’m here now and currently I have 150ms ping with no packet loss but it’s 10am). There are some periods of the day where it ~never gets better than 10 sec and ~10% loss. The internet has gotten a lot better in the past ~year; it used to be that bad all the time except in the early mornings.

    …Let’s load some websites that programmers might frequent with a variety of simulated connections to get data on page load times…The timeout for tests was 6 minutes; anything slower than that is listed as FAIL. Pages that failed to load are also listed as FAIL. A few things that jump out from the table are:

    1. A large fraction of the web is unusable on a bad connection. Even on a good (0% packet loss, no ping spike) dialup connection, some sites won’t load…If you were to look at the 90%-ile results, you’d see that most pages fail to load on dialup and the “Bad” and “😱” connections are hopeless for almost all sites.
    2. Some sites will use a lot of data!

    …The flaw in the “page weight doesn’t matter because average speed is fast” [claim] is that if you average the connection of someone in my apartment building (which is wired for 1Gbps internet) and someone on 56k dialup, you get an average speed of 500 Mbps. That doesn’t mean the person on dialup is actually going to be able to load a 5MB website. The average speed of 3.9 Mbps comes from a 2014 Akamai report, but it’s just an average. If you look at Akamai’s 2016 report, you can find entire countries where more than 90% of IP addresses are slower than that!..“Use bcrypt” has become the mantra for a reasonable default if you’re not sure what to do when storing passwords. The web would be a nicer place if “use webpagetest” caught on in the same way. It’s not always the best tool for the job, but it sure beats the current defaults.