Sidenotes (Link Bibliography)

“Sidenotes” links:





  5. Red






  11. ⁠, Dave Liepmann (Tufte-CSS) ():

    One of the most distinctive features of Tufte’s style is his extensive use of sidenotes.3 Sidenotes are like footnotes, except they don’t force the reader to jump their eye to the bottom of the page, but instead display off to the side in the margin. Perhaps you have noticed their use in this document already. You are very astute.

    Sidenotes are a great example of the web not being like print. On sufficiently large viewports, Tufte CSS uses the margin for sidenotes, margin notes, and small figures. On smaller viewports, elements that would go in the margin are hidden until the user toggles them into view. The goal is to present related but not necessary information such as asides or citations as close as possible to the text that references them. At the same time, this secondary information should stay out of the way of the eye, not interfering with the progression of ideas in the main text.

    …If you want a sidenote without footnote-style numberings, then you want a margin note. Notice there isn’t a number preceding the note. On large screens, a margin note is just a sidenote that omits the reference number. This lessens the distracting effect taking away from the flow of the main text, but can increase the cognitive load of matching a margin note to its referent text.






















  33. ⁠, Andy Matuschak, Michael Nielsen (2019-10):

    [Long writeup by Andy Matuschak and Michael Nielsen on experiment in integrating spaced repetition systems with a tutorial on quantum computing, Quantum Country: Quantum Computing For The Very Curious By combining explanation with spaced testing, a notoriously thorny subject may be learned more easily and then actually remembered—such a system demonstrating a possible ‘tool for thought’. Early results indicate users do indeed remember the quiz answers, and feedback has been positive.]

    Part I: Memory systems

    • Introducing the mnemonic medium
    • The early impact of the prototype mnemonic medium
    • Expanding the scope of memory systems: what types of understanding can they be used for?
    • Improving the mnemonic medium: making better cards
    • Two cheers for mnemonic techniques
    • How important is memory, anyway?
    • How to invent Hindu-Arabic numerals?

    Part II: Exploring tools for thought more broadly:

    • Mnemonic video

    • Why isn’t there more work on tools for thought today?

    • Questioning our basic premises

      • What if the best tools for thought have already been discovered?
      • Isn’t this what the tech industry does? Isn’t there a lot of ongoing progress on tools for thought?
      • Why not work on AGI or BCI instead?
    • Executable books

      • Serious work and the aspiration to canonical content
      • Stronger emotional connection through an inverted writing structure

    Summary and Conclusion

    … in Quantum Country an expert writes the cards, an expert who is skilled not only in the subject matter of the essay, but also in strategies which can be used to encode abstract, conceptual knowledge. And so Quantum Country provides a much more scalable approach to using memory systems to do abstract, conceptual learning. In some sense, Quantum Country aims to expand the range of subjects users can comprehend at all. In that, it has very different aspirations to all memory systems.

    More generally, we believe memory systems are a far richer space than has previously been realized. Existing memory systems barely scratch the surface of what is possible. We’ve taken to thinking of Quantum Country as a memory laboratory. That is, it’s a system which can be used both to better understand how memory works, and also to develop new kinds of memory system. We’d like to answer questions such as:

    • What are new ways memory systems can be applied, beyond the simple, declarative knowledge of past systems?
    • How deep can the understanding developed through a memory system be? What patterns will help users deepen their understanding as much as possible?
    • How far can we raise the human capacity for memory? And with how much ease? What are the benefits and drawbacks?
    • Might it be that one day most human beings will have a regular memory practice, as part of their everyday lives? Can we make it so memory becomes a choice; is it possible to in some sense solve the problem of memory?



  37. {#linkBibliography-bäuerle-(pair)-2020 .docMetadata}, Alex Bäuerle, James Wexler (PAIR) (2020-01-13):

    ⁠, a neural network published by Google in 2018, excels in natural language understanding. It can be used for multiple different tasks, such as sentiment analysis or next sentence prediction, and has recently been integrated into Google Search. This novel model has brought a big change to language modeling as it outperformed all its predecessors on multiple different tasks. Whenever such breakthroughs in deep learning happen, people wonder how the network manages to achieve such impressive results, and what it actually learned. A common way of looking into neural networks is feature visualization. The ideas of feature visualization are borrowed from Deep Dream, where we can obtain inputs that excite the network by maximizing the activation of neurons, channels, or layers of the network. This way, we get an idea about which part of the network is looking for what kind of input.

    In Deep Dream, inputs are changed through gradient descent to maximize activation values. This can be thought of as similar to the initial training process, where through many iterations, we try to optimize a mathematical equation. But instead of updating network parameters, Deep Dream updates the input sample. What this leads to is somewhat psychedelic but very interesting images, that can reveal to what kind of input these neurons react. Examples for Deep Dream processes with images from the original Deep Dream blogpost. Here, they take a randomly initialized image and use Deep Dream to transform the image by maximizing the activation of the corresponding output neuron. This can show what a network has learned about different classes or for individual neurons.

    Feature visualization works well for image-based models, but has not yet been widely explored for language models. This blogpost will guide you through experiments we conducted with feature visualization for BERT. We show how we tried to get BERT to dream of highly activating inputs, provide visual insights of why this did not work out as well as we hoped, and publish tools to explore this research direction further. When dreaming for images, the input to the model is gradually changed. Language, however, is made of discrete structures, ie. tokens, which represent words, or word-pieces. Thus, there is no such gradual change to be made…Looking at a single pixel in an input image, such a change could be gradually going from green to red. The green value would slowly go down, while the red value would increase. In language, however, we can not slowly go from the word “green” to the word “red”, as everything in between does not make sense. To still be able to use Deep Dream, we have to utilize the so-called Gumbel-Softmax trick, which has already been employed in a paper by ⁠. This trick was introduced by Jang et. al. and Maddison et. al.. It allows us to soften the requirement for discrete inputs, and instead use a linear combination of tokens as input to the model. To assure that we do not end up with something crazy, it uses two mechanisms. First, it constrains this linear combination so that the linear weights sum up to one. This, however, still leaves the problem that we can end up with any linear combination of such tokens, including ones that are not close to real tokens in the embedding space. Therefore, we also make use of a temperature parameter, which controls the sparsity of this linear combination. By slowly decreasing this temperature value, we can make the model first explore different linear combinations of tokens, before deciding on one token.

    …The lack of success in dreaming words to highly activate specific neurons was surprising to us. This method uses gradient descent and seemed to work for other models (see Poerner et al 2018). However, BERT is a complex model, arguably much more complex than the models that have been previously investigated with this method.






  43. {#linkBibliography-yu-(sudowrite)-2020 .docMetadata}, James Yu, GPT-3 (Sudowrite) (2020-08-20):

    [Fiction writing exercise by James Yu⁠, using OpenAI as a coauthor and interlocutor, to write a SF story about AIs and the ⁠. Rather than edit GPT-3 output, Yu writes most passages and alternates with GPT-3 completions. Particularly striking for the use of meta-fictional discussion, presented in sidenotes, where Yu and GPT-3 debate the events of the story: “I allowed GPT-3 to write crucial passages, and each time, I chatted with it”in character“, prompting it to role-play.”]

    In each of these stories, colored text indicates a passage written by GPT-3. I used the Sudowrite app to generate a set of possibilities, primed with the story’s premise and a few paragraphs.

    I chatted with GPT-3 about the passage, prompting it to roleplay as the superintelligent AI character in each story. I question the AI’s intent, leading to a meta-exchange where we both discover and create the fictional narrative in parallel. This kind of interaction—where an author can spontaneously talk to their characters—can be an effective tool for creative writing. And at times, it can be quite unsettling.

    Can GPT-3 hold beliefs? Probably not, since it is simply a pile of word vectors. However, these transcripts could easily fool me into believing that it does.








  51. sidenotes.js


  53. 2002-scholz-radiance

  54. popups.js: ⁠, Said Achmiz (2019-08-21; wikipedia):

    popups.js: standalone Javascript library for creating ‘popups’ which display link metadata (typically, title/​​​​author/​​​​date/​​​​summary), for extremely convenient reference/​​​​abstract reading, with mobile and YouTube support. Whenever any such link is mouse-overed by the user, popups.js will pop up a large tooltip-like square with the contents of the attributes. This is particularly intended for references, where it is extremely convenient to autopopulate links such as to​​​​​​​​Pubmed/​​​​PLOS/​​​​​​​​Wikipedia with the link’s title/​​​​author/​​​​date/​​​​abstract, so the reader can see it instantly.

    popups.js parses a HTML document and looks for <a> links which have the docMetadata attribute class, and the attributes data-popup-title, data-popup-author, data-popup-date, data-popup-doi, data-popup-abstract. (These attributes are expected to be populated already by the HTML document’s compiler, however, they can also be done dynamically. See for an example of a library which does Wikipedia-only dynamically on page loads.)

    For an example of a Hakyll library which generates annotations for Wikipedia/​​​​Biorxiv/​​​​⁠/​​​​PDFs/​​​​arbitrarily-defined links, see LinkMetadata.hs⁠.



  57. 2005-wallace.pdf: “Host: Deep into the mercenary world of take-no-prisoners political talk radio”⁠, David Foster Wallace

  58. 2005-wallace-redesign.pdf: “Host: Deep into the mercenary world of take–no–prisoners political talk radio [footnote redesign]”⁠, Said Achmiz

  59. ⁠, Alvaro de Menard (2020-01-17):

    [Summary of the that gripped Western classical literary scholarship for centuries: who wrote the Iliad/​​​​Odyssey, when, and how? They appear in Greek history out of nowhere: 2 enormously lengthy, sophisticated, beautiful, canonical, unified works that would dominate Western literature for millennia, and yet, appeared to draw on no earlier tradition nor did Homer have any earlier (non-spurious) works. How was this possible?

    The iconoclastic Analysts proposed it was a fraud, and the works were pieced together later out of scraps from many earlier poets. The Unitarians pointed to the overall quality; the complex (apparently planned) structure; the disagreements of Analysts on what parts were what pieces; and the Analysts’ inability to explain many anomalies in Homer: there are passages splicing together Greek dialects, passages which were metrical only given long-obsolete Greek letters/​​​​pronunciations, and even individual words which mixed up Greek dialects! (Not that these anomalies were all that much easier to explain by the Unitarian hypothesis of a single author).

    The eventual resolution relied an old hypothesis: that Homer was in fact the product of a lost ⁠. There was, unfortunately, no particular evidence for it, and so it never made any headway against the Analysts or Unitarians—until Milman Parry found a living oral tradition of epic poetry in the Balkans, and discovered in it all the signs of the Homeric poems, from repetitive epithets to a patchwork of dialects, and thus empirical examples of how long oral traditions could produce a work like Homer if one of them happened to get written down at some point.]



  62. ⁠, Sam McCandlish, Jared Kaplan, Dario Amodei (2018-12-14):

    that the gradient noise scale, a simple statistical metric, predicts the parallelizability of neural network training on a wide range of tasks. Since complex tasks tend to have noisier gradients, increasingly large batch sizes are likely to become useful in the future, removing one potential limit to further growth of AI systems. More broadly, these results show that neural network training need not be considered a mysterious art, but can be rigorized and systematized.

    In an increasing number of domains it has been demonstrated that deep learning models can be trained using relatively large batch sizes without sacrificing data efficiency. However the limits of this massive data parallelism seem to differ from domain to domain, ranging from batches of tens of thousands in to batches of millions in RL agents that play the game Dota 2. To our knowledge there is limited conceptual understanding of why these limits to batch size differ or how we might choose the correct batch size in a new domain. In this paper, we demonstrate that a simple and easy-to-measure statistic called the gradient noise scale predicts the largest useful batch size across many domains and applications, including a number of supervised learning datasets (MNIST, SVHN, CIFAR-10, ImageNet, Billion Word), domains (Atari and Dota), and even generative model training (autoencoders on SVHN). We find that the noise scale increases as the loss decreases over a training run and depends on the model size primarily through improved model performance. Our empirically-motivated theory also describes the tradeoff between compute-efficiency and time-efficiency, and provides a rough model of the benefits of adaptive batch-size training.

    The gradient noise scale (appropriately averaged over training) explains the vast majority (R2 = 80%) of the variation in critical batch size over a range of tasks spanning six orders of magnitude. Batch sizes are measured in either number of images, tokens (for language models), or observations (for games).

    …We have found that by measuring the gradient noise scale, a simple statistic that quantifies the signal-to-noise ratio of the network gradients, we can approximately predict the maximum useful batch size. Heuristically, the noise scale measures the variation in the data as seen by the model (at a given stage in training). When the noise scale is small, looking at a lot of data in parallel quickly becomes redundant, whereas when it is large, we can still learn a lot from huge batches of data…We’ve found it helpful to visualize the results of these experiments in terms of a tradeoff between wall time for training and total bulk compute that we use to do the training (proportional to dollar cost). At very small batch sizes, doubling the batch allows us to train in half the time without using extra compute (we run twice as many chips for half as long). At very large batch sizes, more parallelization doesn’t lead to faster training. There is a “bend” in the curve in the middle, and the gradient noise scale predicts where that bend occurs.

    Increasing parallelism makes it possible to train more complex models in a reasonable amount of time. We find that a Pareto frontier chart is the most intuitive way to visualize comparisons between algorithms and scales.

    …more powerful models have a higher gradient noise scale, but only because they achieve a lower loss. Thus, there’s some evidence that the increasing noise scale over training isn’t just an artifact of convergence, but occurs because the model gets better. If this is true, then we expect future, more powerful models to have higher noise scale and therefore be more parallelizable. Second, tasks that are subjectively more difficult are also more amenable to parallelization…we have evidence that more difficult tasks and more powerful models on the same task will allow for more radical data-parallelism than we have seen to date, providing a key driver for the continued fast exponential growth in training compute.






  68. ⁠, Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, Mark McGranaghan (Ink & Switch) (2019-04):

    [PDF version]

    Cloud apps like Google Docs and Trello are popular because they enable real-time collaboration with colleagues, and they make it easy for us to access our work from all of our devices. However, by centralizing data storage on servers, cloud apps also take away ownership and agency from users. If a service shuts down, the software stops functioning, and data created with that software is lost.

    In this article we propose “local-first software”: a set of principles for software that enables both collaboration and ownership for users. Local-first ideals include the ability to work offline and collaborate across multiple devices, while also improving the security, privacy, long-term preservation, and user control of data.

    We survey existing approaches to data storage and sharing, ranging from email attachments to web apps to Firebase-backed mobile apps, and we examine the trade-offs of each. We look at Conflict-free Replicated Data Types (CRDTs): data structures that are multi-user from the ground up while also being fundamentally local and private. CRDTs have the potential to be a foundational technology for realizing local-first software.

    We share some of our findings from developing local-first software prototypes at Ink & Switch over the course of several years. These experiments test the viability of CRDTs in practice, and explore the user interface challenges for this new data model. Lastly, we suggest some next steps for moving towards local-first software: for researchers, for app developers, and a startup opportunity for entrepreneurs.

    …in the cloud, ownership of data is vested in the servers, not the users, and so we became borrowers of our own data. The documents created in cloud apps are destined to disappear when the creators of those services cease to maintain them. Cloud services defy long-term preservation. No Wayback Machine can restore a sunsetted web application. The cannot preserve your Google Docs.

    In this article we explored a new way forward for software of the future. We have shown that it is possible for users to retain ownership and control of their data, while also benefiting from the features we associate with the cloud: seamless collaboration and access from anywhere. It is possible to get the best of both worlds.

    But more work is needed to realize the local-first approach in practice. Application developers can take incremental steps, such as improving offline support and making better use of on-device storage. Researchers can continue improving the algorithms, programming models, and user interfaces for local-first software. Entrepreneurs can develop foundational technologies such as CRDTs and peer-to-peer networking into mature products able to power the next generation of applications.

    • Motivation: collaboration and ownership

    • Seven ideals for local-first software

      • No spinners: your work at your fingertips
      • Your work is not trapped on one device
      • The network is optional
      • Seamless collaboration with your colleagues
      • The Long Now
      • Security and privacy by default
      • You retain ultimate ownership and control
    • Existing data storage and sharing models

      • How application architecture affects user experience
        • Files and email attachments
        • Web apps: Google Docs, Trello, Figma
        • Dropbox, Google Drive, Box, OneDrive, etc.
        • Git and GitHub
      • Developer infrastructure for building apps
        • Web app (thin client)
        • Mobile app with local storage (thick client)
        • Backend-as-a-Service: Firebase, CloudKit, Realm
        • CouchDB
    • Towards a better future

      • CRDTs as a foundational technology
      • Ink & Switch prototypes
        • Trello clone
        • Collaborative drawing
        • Media canvas
        • Findings
      • How you can help
        • For distributed systems and programming languages researchers
        • For Human-Computer Interaction (HCI) researchers
        • For practitioners
        • Call for startups
    • Conclusions