“You can have a second computer once you’ve shown you know how to use the first one.”
-
“Welcome to the Jungle” (CPU design and Moore’s law; cf. “How Many Computers In Your Computer?”)
-
“There’s plenty of room at the Top: What will drive computer performance after Moore’s law?”, Leiserson et al 2020 (gains from End-To-End Principle-based systems design1)
-
“Scalability! But at what COST?”, McSherry et al 2015 (when laptops > clusters: the importance of good baselines; “Big Data” doesn’t fit in a laptop)
-
“What it takes to run Stack Overflow”, Nick Craver
-
“Parsing Gigabytes of JSON per Second”, Langdale & Lemire 2019
-
55 GiB / s FizzBuzz; “Modern storage is plenty fast. It is the APIs that are bad.”; “10M IOPS, one physical core.
#io_uring #linux
”; “Achieving 11M IOPS & 66 GB / s IO on a Single ThreadRipper Workstation” -
Ultra-low bit-rate audio for tiny podcasts with Codec2 and WaveNet decoders (1 hour = 1MB)
-
“The computers are fast, but you don’t know it” (replacing slow pandas with 6,000× faster parallelized C++)
-
“Scaling SQLite to 4M QPS on a Single Server (EC2 vs Bare Metal)”, Expensify
-
- “8088 Domination”: “how I created 8088 Domination, which is a program that displays fairly decent full-motion video on a 1981 IBM PC”/part 2
- “A Mind Is Born” (256 bytes; for comparison, this line of Markdown linking the demo is 194 bytes long, and the video several million bytes)
-
Casey Muratori: refterm v2, Witness pathing (summary)
-
- “Algorithmic Progress in 6 Domains”, Grace 2013
- “A Time Leap Challenge for SAT Solving”, Fichte et al 2020 (hardware overhang: ~2.5× performance increase in SAT solving since 2000, about equally due to software & hardware gains, although slightly more software—new software on old hardware beats old on new.)
- 2019 Unlocking of the LCS35 Challenge
- “Measuring hardware overhang”, hippke (“with today’s algorithms, computers would have beat the world chess champion already in 1994 on a contemporary desk computer”)
-
Memory Locality:
- “What Every Programmer Should Know About Memory”, Drepper 2007
- “You’re Doing It Wrong: Think you’ve mastered the art of server performance? Think again.”, Poul-Henning Kamp 2010 (on using cache-oblivious B-heaps to optimize Varnish performance 10×)
- “The LMAX Architecture”, Fowler 2011
- “5000× faster CRDTs: An Adventure in Optimization”
- “In-place Superscalar Samplesort (IPS4o): Engineering In-place (Shared-memory) Sorting Algorithms”, Axtmann et al 2020
- “Fast approximate string matching with large edit distances in Big Data (2015)”
- “ParPaRaw: Massively Parallel Parsing of [CSV] Delimiter-Separated Raw Data”, Stehl & Jacobsen 2019
-
DL/Deep Reinforcement Learning:
- Training A3C to solve Atari Pong in <4 minutes on the ARCHER supercomputer through brute parallelism
-
“
megastep
helps you build 1-million FPS reinforcement learning environments on a single GPU”, Jones - “The Mathematics of 2048: Optimal Play with Markov Decision Processes”
- Bruteforcing Breakout (optimal play by depth-first search of an approximate MDP in an optimized C++ simulator for 6 CPU-years; crazy gameplay—like chess endgame tables, a glimpse of superintelligence)
- “Sample Factory: Egocentric 3D Control from Pixels at 100,000 FPS with Asynchronous Reinforcement Learning”, Petrenko et al 2020; “Megaverse: Simulating Embodied Agents at One Million Experiences per Second”, Petrenko et al 2021; “Brax: A Differentiable Physics Engine for Large Scale Rigid Body Simulation”, Freeman et al 2021; “Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning”, Makoviychuk et al 2021 (Rudin et al 2021); WarpDrive
- “Large Batch Simulation for Deep Reinforcement Learning”, Shacklett et al 2021
- “Scaling Scaling Laws with Board Games”, Jones 2021
- “Mastering Real-Time Strategy Games with Deep Reinforcement Learning: Mere Mortal Edition”, Winter
- “Scaling down Deep Learning”, Greydanus 2020
- “Podracer architectures for scalable Reinforcement Learning”, Hessel et al 2021 (highly-efficient TPU pod use: eg. solving Pong in <1min at 43 million FPS on a TPU-2048); “Q-DAX: Accelerated Quality-Diversity for Robotics through Massive Parallelism”, Lim et al 2022; “EvoJAX: Hardware-Accelerated Neuroevolution”, Tang et al 2022
- “Training the Ant simulating at 1 million steps per second on CUDA (NVIDIA RTX 2080), using Tiny Differentiable Simulator and C++ Augmented Random Search.” (Erwin Coumans 2021)
“A novice was trying to fix a broken Lisp machine by turning the power off and on.”
“Knight, seeing what the student was doing, spoke sternly: “You cannot fix a machine by just power-cycling it with no understanding of what is going wrong.””
“Knight turned the machine off and on.”
“The machine worked.”
-
AI scaling can continue even if semiconductors do not, particularly with self-optimizing stacks; John Carmack notes, apropos of Seymour Cray, that “Hyperscale data centers and even national supercomputers are loosely coupled things today, but if challenges demanded it, there is a world with a zetta[flops] scale, tightly integrated, low latency matrix dissipating a gigawatt in a swimming pool of circulating fluorinert.”↩︎