--- title: "Computer Optimization: Your Computer Is Faster Than You Think" created: 2021-04-24 modified: 2021-04-24 status: finished confidence: highly-likely importance: 6 cssExtension: dropcaps-yinit ...
> You can have a second computer once you’ve shown you know how to use the first one. > > [Paul Barham](#mcsherry-et-al-2015)
- [Tacit knowledge](!W)/[mechanical sympathy](https://dzone.com/articles/mechanical-sympathy) - ["Welcome to the Jungle"](https://herbsutter.com/welcome-to-the-jungle/) (CPU design and Moore’s law; cf. ["How Many Computers In Your Computer?"](/computers "'How Many Computers Are In Your Computer?', Branwen 2010")) - ["There’s plenty of room at the Top: What will drive computer performance after Moore’s law?"](/doc/cs/hardware/2020-leiserson.pdf), Leiserson et al 2020 (gains from [**End-To-End Principle**](/doc/cs/end-to-end-principle/index)-based systems design^[AI scaling can continue even if semiconductors do not, particularly with [self-optimizing stacks](/tool-ai "'Why Tool AIs Want to Be Agent AIs', Branwen 2016"); [John Carmack](https://twitter.com/ID_AA_Carmack/status/1300280139071270914 "Seymour Cray was famous for packing, powering, and cooling circuits incredibly densely. Classic Crays were made obsolete by microprocessors, but we may yet do similar things at a larger scale. Hyperscale data centers and even national supercomputers are loosely coupled things today, but if challenges demanded it, there is a world with a zetta scale, tightly integrated, low latency matrix dissipating a gigawatt in a swimming pool of circulating fluorinert.") notes, apropos of [Seymour Cray](https://en.wikipedia.org/wiki/Seymour_Cray), that "Hyperscale data centers and even national supercomputers are loosely coupled things today, but if challenges demanded it, there is a world with a zetta[flops] scale, tightly integrated, low latency matrix dissipating a gigawatt in a swimming pool of circulating [fluorinert](https://en.wikipedia.org/wiki/Fluorinert)."]) - ["Scalability! But at what COST?"](/doc/cs/algorithm/2015-mcsherry.pdf), McSherry et al 2015 (when laptops > clusters: the importance of good baselines; ["Big Data" doesn't fit in a laptop](https://shkspr.mobi/blog/2021/04/whats-the-origin-of-the-phrase-big-data-doesnt-fit-in-excel/)); ["Command-line Tools can be 235× Faster than your Hadoop Cluster"](https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html "Adam Drake 2014-01-18") - ["What it takes to run Stack Overflow"](https://nickcraver.com/blog/2013/11/22/what-it-takes-to-run-stack-overflow/), Nick Craver; ["We stand to save $7m over 5 years from our cloud exit"](https://world.hey.com/dhh/we-stand-to-save-7m-over-five-years-from-our-cloud-exit-53996caa), [David Heinemeier Hansson](!W); ["Production Twitter on One Machine? 100Gbps NICs and NVMe are fast"](https://thume.ca/2023/01/02/one-machine-twitter/) - ["Anatomy of a hack: How crackers ransack passwords like `qeadzcwrsfxv1331`; For Ars, 3 crackers have at 16,000+ hashed passcodes---with 90% success"](https://arstechnica.com/information-technology/2013/05/how-crackers-make-minced-meat-out-of-your-passwords/) - ["The Science of Brute Force"](https://cacm.acm.org/research/the-science-of-brute-force/) - ["Bracket pair colorization 10,000× faster"](https://code.visualstudio.com/blogs/2021/09/29/bracket-pair-colorization) - ["Parsing Gigabytes of JSON per Second"](https://arxiv.org/abs/1902.08318), Langdale & Lemire 2019 - [59 GB/s FizzBuzz](https://codegolf.stackexchange.com/questions/215216/high-throughput-fizz-buzz/236630#236630); ["Modern storage is plenty fast. It is the APIs that are bad."](https://itnext.io/modern-storage-is-plenty-fast-it-is-the-apis-that-are-bad-6a68319fbc1a); ["10M IOPS, one physical core. `#io_uring #linux`"](https://twitter.com/axboe/status/1452689372395053062); ["Achieving 11M IOPS & 66 GB/s IO on a Single ThreadRipper Workstation"](https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/) - ["The C10K problem"](http://www.kegel.com/c10k.html), Kegel 2001; ["Serving Netflix Video at 50 gigabyte/s on FreeBSD"](https://papers.freebsd.org/2021/eurobsdcon/gallatin-netflix-freebsd-400gbps/), Gallatin 2021/["Serving Netflix Video Traffic at 100 gigabyte/s and Beyond"](https://nabstreamingsummit.com/wp-content/uploads/2022/05/2022-Streaming-Summit-Netflix.pdf), Gallatin 2022 - ["True Stories of Algorithmic Improvement"](https://www.lesswrong.com/posts/RoyG2GwYAwoka47Ht/true-stories-of-algorithmic-improvement) - [Ultra-low bit-rate audio for tiny podcasts with Codec2 and WaveNet decoders (1 hour = 1MB)](https://auphonic.com/blog/2018/06/01/codec2-podcast-on-floppy-disk/ "Codec2: a whole Podcast on a Floppy Disk") - ["The computers are fast, but you don't know it"](https://shvbsle.in/computers-are-fast-but-you-dont-know-it-p1/) (replacing slow [pandas](!W "pandas (software)") with 6,000× faster parallelized C++) - ["Scaling SQLite to 4M QPS on a Single Server (EC2 vs Bare Metal)"](https://use.expensify.com/blog/scaling-sqlite-to-4m-qps-on-a-single-server), Expensify; ["50% faster than 3.7.17"](https://sqlite-users.sqlite.narkive.com/CVRvSKBs/50-faster-than-3-7-17) ("... We have achieved this by incorporating hundreds of micro-optimizations. Each micro-optimization might improve the performance by as little as 0.05%.") - ["Optimization story—quantum mechanics simulation speedup"](https://tinkering.xyz/fmo-optimization-story/); ["14,000× Speedup a.k.a. Computer Science For the Win"](http://james.hiebert.name/blog/work/2015/09/14/CS-FTW.html "James Hiebert 2015-09-14") - ["Cutting The Pipe: Achieving Sub-Second Iteration Times"](/doc/cs/algorithm/2012-frykholm.pdf), Frykholm 2012 (optimizing game development) - ["The Billion Row Challenge (1BRC): Step-by-step in Java from 71s to 1.7s"](https://questdb.io/blog/billion-row-challenge-step-by-step/ "Marko Topolnik 2024-02-20") (and [4s in Go](https://benhoyt.com/writings/go-1brc/) & [0.5s in C](https://github.com/gunnarmorling/1brc/discussions/710 "'Fastest known solution: 0.577s (8 core Zen2); C with heavy AVX2', Austin Donisan 2024-01-31")) - ["GCC Compiler vs Human---119× Faster Assembly 💻🆚🧑‍💻"](https://ashvardanian.com/posts/gcc-12-vs-avx512fp16/ "Ash Vardanian 2023-10-23"); ["Python, C, Assembly---2,500× Faster Cosine Similarity 📐"](https://ashvardanian.com/posts/python-c-assembly-comparison/ "Ash Vardanian 2023-10-30") - **[Demoscene](!W)**: - ["8088 Domination"](https://www.youtube.com/watch?v=MWdG413nNkI "8088 Domination: Video capture from an IBM PC 5160"): ["how I created 8088 Domination, which is a program that displays fairly decent full-motion video on a 1981 IBM PC"](https://trixter.oldskool.org/2014/06/19/8088-domination-post-mortem-part-1/)/[part 2](https://trixter.oldskool.org/2014/06/20/8088-domination-post-mortem-conclusion/ "8088 Domination Post-Mortem, Conclusion") - ["A Mind Is Born"](https://linusakesson.net/scene/a-mind-is-born/) (256 bytes; for comparison, this line of Markdown linking the demo is 194 bytes long, and the video several million bytes) - Casey Muratori: [refterm v2](https://github.com/cmuratori/refterm), [_Witness_ pathing](https://www.youtube.com/watch?v=Ge3aKEmZcqY "Simple Code, High Performance") ([summary](https://www.reddit.com/r/programming/comments/q76ea8/simple_code_high_performance_lecture_by_casey/hghky4o/)) - [**Algorithmic Progress**](/doc/economics/experience-curve/index): - ["Algorithmic Progress in 6 Domains"](/doc/ai/scaling/2013-grace.pdf#miri "'Algorithmic Progress in Six Domains', Grace 2013"), Grace 2013 - ["A Time Leap Challenge for SAT Solving"](https://arxiv.org/abs/2008.02215), Fichte et al 2020 (hardware overhang: ~2.5× performance increase in [SAT solving](https://en.wikipedia.org/wiki/Boolean_satisfiability_problem#Algorithms_for_solving_SAT) since 2000, about equally due to software & hardware gains, although slightly more software---new software on old hardware beats old on new.) - [2019 Unlocking of the LCS35 Challenge](/self-decrypting#unlocking-of-lcs35) - ["Measuring hardware overhang"](https://www.lesswrong.com/posts/75dnjiD8kv2khe9eQ/measuring-hardware-overhang), hippke ("with today's algorithms, computers would have beat the world chess champion already in 1994 on a contemporary desk computer") - **Memory Locality**: - ["What Every Programmer Should Know About Memory"](/doc/cs/hardware/2007-drepper.pdf), Drepper 2007 - ["You’re Doing It Wrong: Think you’ve mastered the art of server performance? Think again."](https://queue.acm.org/detail.cfm?id=1814327), [Poul-Henning Kamp](https://en.wikipedia.org/wiki/Poul-Henning_Kamp) 2010 (on using [cache-oblivious](https://en.wikipedia.org/wiki/Cache-oblivious_algorithm) [B-heaps](https://en.wikipedia.org/wiki/B-heap) to optimize Varnish performance 10×) - ["The LMAX Architecture"](https://martinfowler.com/articles/lmax.html), Fowler 2011 ([TigerBeetle](https://github.com/tigerbeetle/tigerbeetle/blob/main/docs/DESIGN.md#architecture)) - ["5000× faster CRDTs: An Adventure in Optimization"](https://josephg.com/blog/crdts-go-brrr/) - ["In-place Superscalar Samplesort (IPS^4^o): Engineering In-place (Shared-memory) Sorting Algorithms"](https://arxiv.org/abs/2009.13569 "'Engineering In-place (Shared-memory) Sorting Algorithms', Axtmann et al 2020"), Axtmann et al 2020; ["Vectorized and performance-portable Quicksort"](https://arxiv.org/abs/2205.05982#google), Blacher et al 2022 - ["Fast approximate string matching with large edit distances in Big Data (2015)"](https://wolfgarbe.medium.com/fast-approximate-string-matching-with-large-edit-distances-in-big-data-2015-9174a0968c0b) - ["ParPaRaw: Massively Parallel Parsing of \[CSV\] Delimiter-Separated Raw Data"](https://arxiv.org/abs/1905.13415 "'ParPaRaw: Massively Parallel Parsing of Delimiter-Separated Raw Data', Stehle & Jacobsen 2019"), Stehl & Jacobsen 2019 - ["Scaling our Spreadsheet Engine from Thousands to Billions of Cells: From Maps To Arrays"](https://www.causal.app/blog/scaling), Lukas Köbis 2022 - ["Persistent RNNs: Stashing Recurrent Weights On-Chip"](/doc/ai/nn/rnn/2016-diamos.pdf#baidu), Diamos et al 2016; ["FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness"](https://arxiv.org/abs/2205.14135), Dao et al 2022 - [_Algorithms for Modern Hardware_](https://en.algorithmica.org/hpc/) - [**DL**](/doc/ai/nn/sparsity/index){#gwern-notes-sparsity}/**Deep Reinforcement Learning**: - [Training A3C to solve Atari Pong in <4 minutes on the ARCHER supercomputer through brute parallelism](https://community.arm.com/arm-community-blogs/b/high-performance-computing-blog/posts/deep-learning-episode-4-supercomputer-vs-pong-ii) - ["`megastep` helps you build 1-million FPS reinforcement learning environments *on a single GPU*"](https://andyljones.com/megastep/), Jones - ["The Mathematics of _2048_: Optimal Play with Markov Decision Processes"](https://jdlm.info/articles/2018/03/18/markov-decision-process-2048.html) - [Bruteforcing _Breakout_](https://tasvideos.org/6347S "Submission #6347: Chef Stef's NES Arkanoid 'warpless' in 11:11.18") (optimal play by depth-first search of an approximate MDP in an optimized C++ simulator for 6 CPU-years; crazy gameplay---like [chess endgame tables](/doc/ai/1985-michie.pdf "'Human Window on the World', Michie 1985"), a glimpse of superintelligence) - ["Sample Factory: Egocentric 3D Control from Pixels at 100,000 FPS with Asynchronous Reinforcement Learning"](https://arxiv.org/abs/2006.11751#intel), Petrenko et al 2020; ["Megaverse: Simulating Embodied Agents at One Million Experiences per Second"](https://arxiv.org/abs/2107.08170#intel), Petrenko et al 2021; ["Brax: A Differentiable Physics Engine for Large Scale Rigid Body Simulation"](https://arxiv.org/abs/2106.13281#google "'Brax—A Differentiable Physics Engine for Large Scale Rigid Body Simulation', Freeman et al 2021"), Freeman et al 2021; ["Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning"](https://arxiv.org/abs/2108.10470#nvidia), Makoviychuk et al 2021 ([Rudin et al 2021](https://arxiv.org/abs/2109.11978#nvidia "Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning")); [WarpDrive](https://arxiv.org/abs/2108.13976#salesforce "'WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU', Lan et al 2021") - ["Large Batch Simulation for Deep Reinforcement Learning"](https://arxiv.org/abs/2103.07013), Shacklett et al 2021 - ["Scaling Scaling Laws with Board Games"](https://arxiv.org/abs/2104.03113), Jones 2021 - ["Mastering Real-Time Strategy Games with Deep Reinforcement Learning: Mere Mortal Edition"](https://clemenswinter.com/2021/03/24/mastering-real-time-strategy-games-with-deep-reinforcement-learning-mere-mortal-edition/), Winter - ["Scaling down Deep Learning"](https://greydanus.github.io/2020/12/01/scaling-down/), Greydanus 2020 - ["Podracer architectures for scalable Reinforcement Learning"](https://arxiv.org/abs/2104.06272#deepmind), Hessel et al 2021 (highly-efficient TPU pod use: eg. solving Pong in <1min at 43 million FPS on a TPU-2048); ["Q-DAX: Accelerated Quality-Diversity for Robotics through Massive Parallelism"](https://arxiv.org/abs/2202.01258 "'Accelerated Quality-Diversity for Robotics through Massive Parallelism', Lim et al 2022"), Lim et al 2022; ["EvoJAX: Hardware-Accelerated Neuroevolution"](https://arxiv.org/abs/2202.05008#google), Tang et al 2022; ["JaxMARL: Multi-Agent RL Environments in JAX"](https://arxiv.org/abs/2311.10090), Rutherford et al 2023 - ["Training the Ant simulating at 1 million steps per second on CUDA (NVIDIA RTX 2080), using Tiny Differentiable Simulator and C++ Augmented Random Search."](https://www.youtube.com/watch?v=-MfUXSAehnw "Training a CUDA TDS Ant using C++ ARS Linear policy: The video is real-time, after a few minutes (in the 30 million steps) the training curve is flat (I trained until a billion steps). Note that this Ant is PD control, and not identical to either MuJoCo or PyBullet Ant, so the training curves are not comparable yet. Will fix that.") (Erwin Coumans 2021) - ["TextWorldExpress: Simulating Text Games at One Million Steps Per Second"](https://arxiv.org/abs/2208.01174), Jansen & Côté 2022 - **Latency/UI**: - ["It’s the Latency, Stupid."](http://www.stuartcheshire.org/rants/latency.html) - ["Transatlantic ping faster than sending a pixel to the screen?"](https://superuser.com/questions/419070/transatlantic-ping-faster-than-sending-a-pixel-to-the-screen) - ["Computer latency: 1977--2017"](https://danluu.com/input-lag/) - ["Keyboard Latency"](https://danluu.com/keyboard-latency/ "'Keyboard latency', Luu 2017") - ["Terminal Latency"](https://danluu.com/term-latency/) - ["Local-first software: You own your data, in spite of the cloud"](https://www.inkandswitch.com/local-first.html "'Local-first software: You own your data, in spite of the cloud [web]', Kleppmann et al 2019")/[Kleppmann et al 2019](/doc/cs/algorithm/2019-kleppmann.pdf "Local-First Software: You Own Your Data, in spite of the Cloud") - ["Web Bloat"](https://danluu.com/web-bloat/) - ["Project Starline: A high-fidelity telepresence system"](https://gwern.net/doc/sociology/technology/2021-lawrence.pdf#google), Lawrence et al 2021 - ["Use GPT-3 incorrectly: reduce costs 40× and increase speed by 5×"](https://www.buildt.ai/blog/incorrectusage), Pullen 2023 - [HFT](!W "High-frequency trading#Low-latency strategies"): ["The accidental HFT firm"](https://web.archive.org/web/20210205014443/https://meanderful.blogspot.com/2018/01/the-accidental-hft-firm.html) - [Shenzhen](https://en.wikipedia.org/wiki/Shenzhen#Industry) - **Organizations**: - [WhatsApp](!W) (2013: "50 staff members" for 200m users; 2015: [50 "engineers", 900m users](https://www.wired.com/2015/09/whatsapp-serves-900-million-users-50-engineers/ "‘Why WhatsApp Only Needs 50 Engineers for Its 900M Users: One of the (many) intriguing parts of the WhatsApp story is that it has achieved such enormous scale with such a tiny team’, Metz 2015")) - [Craigslist](!W) (2017: ~50 employees) - [Instagram](!W) (2012 FB acquisition: 13 employees; [architecture](https://read.engineerscodex.com/p/how-instagram-scaled-to-14-million)) - [YouTube](!W) (2006 Google acquisition: ~10 employees) - [Reddit](!W) (2011: [4 engineers](https://www.reddit.com/r/announcements/comments/efpn0/comment/c17qsmr/); 2021: ~700 total staff, >52m users) - [Wikimedia Foundation](!W) (["Internet hosting" in 2021](https://upload.wikimedia.org/wikipedia/foundation/1/1e/Wikimedia_Foundation_FY2020-2021_Audit_Report.pdf#page=5 "WIKIMEDIA FOUNDATION, INC. Consolidated Statements of Activities: Years ended June 30, 2021 and 2020"): [$2.4]($2021)m) - Twitter: Musk acquisition cut up to 90% of employees+contractors; widespread confident assertions by ex-Twitter staff that the site would suffer crashes (possibly terminal crashes which could not be recovered from), within a month or two, turned out to be false. - [See Also]{.smallcaps}: ["Some examples of people quickly accomplishing ambitious things together."](https://patrickcollison.com/fast)
> A novice was trying to fix a broken [Lisp machine](!W) by turning the power off and on. > > Knight, seeing what the student was doing, spoke sternly: “You cannot fix a machine by just power-cycling it with no understanding of what is going wrong.” > > Knight turned the machine off and on. > > The machine worked. > > ["Tom Knight and the Lisp Machine"](http://www.catb.org/jargon/html/koans.html), _AI Koans_