GPT-3-nonfiction (Link Bibliography)

GPT-3-nonfiction links:

  1. GPT-3

  2. GPT-3#prompt-programming

  3. GPT-3#bpes

  4. GPT-3#effective-prompt-programming

  5. https://arxiv.org/pdf/2005.14165.pdf#page=23

  6. https://nitter.hu/j_erhardt

  7. https://arxiv.org/pdf/2005.14165.pdf#page=17

  8. https://leaderboard.allenai.org/break_high_level/submissions/public

  9. http://gptprompts.wikidot.com/context-stuffing

  10. ⁠, Michael Hahn (2019-06-16):

    Transformers are emerging as the new workhorse of NLP, showing great success across tasks. Unlike LSTMs, transformers process input sequences entirely through self-attention. Previous work has suggested that the computational capabilities of self-attention to process hierarchical structures are limited. In this work, we mathematically investigate the computational power of self-attention to model formal languages. Across both soft and hard attention, we show strong theoretical limitations of the computational abilities of self-attention, finding that it cannot model periodic finite-state languages, nor hierarchical structure, unless the number of layers or heads increases with input length. These limitations seem surprising given the practical success of self-attention and the prominent role assigned to hierarchical structure in linguistics, suggesting that natural language can be approximated well with models that are too weak for the formal languages typically assumed in theoretical linguistics.

  11. ⁠, Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, Łukasz Kaiser (2018-07-10):

    Recurrent neural networks (RNNs) sequentially process data by updating their state with each new data point, and have long been the de facto choice for sequence modeling tasks. However, their inherently sequential computation makes them slow to train. Feed-forward and convolutional architectures have recently been shown to achieve superior results on some sequence modeling tasks such as machine translation, with the added advantage that they concurrently process all inputs in the sequence, leading to easy parallelization and faster training times. Despite these successes, however, popular feed-forward sequence models like the Transformer fail to generalize in many simple tasks that recurrent models handle with ease, e.g. copying strings or even simple logical inference when the string or formula lengths exceed those observed at training time. We propose the Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the model and which addresses these issues. UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs. We also add a dynamic per-position halting mechanism and find that it improves accuracy on several tasks. In contrast to the standard Transformer, under certain assumptions, UTs can be shown to be Turing-complete. Our experiments show that UTs outperform standard Transformers on a wide range of algorithmic and language understanding tasks, including the challenging language modeling task where UTs achieve a new state of the art, and machine translation where UTs achieve a 0.9 improvement over Transformers on the WMT14 En-De dataset.

  12. ⁠, Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret (2020-06-29):

    Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input’s length, they are prohibitively slow for very long sequences.

    To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from 𝒪(N2) to 𝒪(N), where N is the sequence length. We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks.

    Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000× faster on autoregressive prediction of very long sequences.

  13. https://nitter.hu/bucketofkets/status/1285100951271952384

  14. https://nitter.hu/Malcolm_Ocean/status/1285099206781341696

  15. https://github.com/maraoz/gpt-scrolls/blob/master/scrolls/rephrase/conceptual-blending.txt

  16. https://nitter.hu/promptengineer/status/1338197266188951553

  17. https://nitter.hu/danielbigham/status/1292229584025464835

  18. https://github.com/Gurkenglas/gpt3coq

  19. https://nitter.hu/ak92501/status/1286122515564306432

  20. ⁠, Yao Dou, Maxwell Forbes, Rik Koncel-Kedziorski, Noah A. Smith, Yejin Choi (2021-07-02; wikipedia⁠, ai):

    Modern neural text generation systems can produce remarkably fluent and grammatical texts. While earlier language models suffered from repetition and syntactic errors, the errors made by contemporary models are often semantic, narrative, or discourse failures.

    To facilitate research of these complex error types, we introduce a new structured, crowdsourced error annotation schema called Scarecrow. The error categories used in Scarecrow—such as redundancy, commonsense errors, and incoherence—were identified by combining expert analysis with several pilot rounds of ontology-free crowd annotation to arrive at a schema which covers the error phenomena found in real machine generated text.

    We use Scarecrow to collect 13k annotations of 1.3k human and machine generate paragraphs of English language news text, amounting to over 41k spans each labeled with its error category, severity, a natural language explanation, and antecedent span (where relevant). We collect annotations for text generated by state-of-the-art systems with varying known performance levels, from GPT-2-small through the largest GPT-3-175b. We isolate several factors for detailed analysis, including parameter count, training data, and decoding technique.

    Our results show both expected and surprising differences across these settings. These findings demonstrate the value of Scarecrow annotations in the assessment of current and future text generation systems. We release our complete annotation toolkit and dataset at Github⁠.

    Figure 2: Average portion of tokens annotated with each span type (y-axis) across models (x-axis), with 95% confidence intervals.
    Figure 3: Average portion of tokens covered by span annotations, broken down by span type. All models, including GPT-3, use the same apples-to-apples decoding hyperparameters: top-p = 0.96, temperature = 1, and no frequency penalty. We scale each span by its token length, normalize by generation token lengths, and remove severity-1 Grammar and Usage errors (see §C).
    Figure 4: Taking the average span coverage (Figure 3) and removing reader issues (Technical Jargon and Needs Google), we plot values and 95% confidence intervals for all models, including all decoding hyperparameters we tested for GPT-3. We find a surprisingly large change in annotated errors depending on the decoding setting used.
    1. Scaling pays off to improve Encyclopedic, Commonsense, and Incoherent errors (Figure 2).

      These error categories decrease with in-domain training () and larger model size (GPT-3). Human text still shows the fewest of these kinds of errors.

    2. Scaling benefits plateau for Off-Prompt, Bad Math, and Grammar & Usage errors (Figure 2).

      These 3 error categories see a model plateau in error reduction when scaling to GPT-3. Of these error types, humans still commit fewer Off-Prompt (more: §6.1) and Grammar & Usage errors, but Bad Math appears saturated for our domain.

    3. Self-Contradiction and Redundant errors exhibit more complex scaling behavior (Figure 2).

      We roughly categorize these trends as rising and falling: increasing for medium or large-scale models, but dropping for human-authored text. Further analysis (§6.2, §6.3) reveals these more complex patterns are affected both by interactions with other error types, as well how errors are counted.

    4. Human-authored text produces the most reader issues (Figure 2–3).

      The Needs Google and Technical Jargon span categories both have a humans highest trend, and both fall under reader issues: problems that are not necessarily errors, but that still prevent full comprehension or factual verification of the text (more: §6.4).

      Furthermore, human-authored text is not free from error annotations (Figure 3). This can serve either as a control for baseline error rates (more: §6.6), or as a mechanism for critiquing human writing.

    5. Decoding hyperparameters have a huge impact (Figure).

      For the previous findings, we fix the sampling configuration for all models to an apples-to-apples setup for fair comparison: top-p = 0.96, (softmax) temperature = 1, and no frequency penalty (i.e., word repetition penalty; defined precisely in §5.2, Equation 1). To study the effects of these decoding settings, we annotate text generated by GPT-3 using a variety of values for top-p and temperature, both with and without a frequency penalty.

      To our surprise, the decoding hyperparameters considerably affected error rates (more: §6.5). As seen in Figure 4, the worst sampling procedure for GPT-3 (argmax sampling with no frequency penalty) performed even worse than GPT-2 XL. But the best sampling procedure (surprisingly, also argmax sampling, but with a frequency penalty) produced text with as few apparent SCARECROW error spans as those authored by humans (more: §6.6).

    …We notice that a greater portion of errors in human-authored text were due to artifacts present in the text-only format of the Common Crawl. For example, links to other articles or advertisements sometimes appear in the middle of an article’s text. While annotators were quick to mark these spans, they reflect errors in formatting, not in writing. We partition these errors separately and exclude them from the subsequent calculations. GPT-3’s generations also sometimes exhibited what appeared to be formatting errors due to training on web-scraped text, though more rarely. For example, some generations contained Which? after vague noun phrases, which appear to be learned from Wikipedia, where under-specified information is tagged by an editor with this word. For fairness, we removed these errors from GPT-3’s tally as well, though they were few enough we do not plot them separately.

  21. ⁠, Armen Aghajanyan, Dmytro Okhonko, Mike Lewis, Mandar Joshi, Hu Xu, Gargi Ghosh, Luke Zettlemoyer (2021-07-14):

    We introduce HTLM, a hyper-text language model trained on a large-scale web crawl. Modeling hyper-text has a number of advantages: (1) it is easily gathered at scale, (2) it provides rich document-level and end-task-adjacent supervision (e.g. class and id attributes often encode document category information), and (3) it allows for new structured prompting that follows the established semantics of HTML (e.g. to do zero-shot summarization by infilling title tags for a webpage that contains the input text). We show that pretraining with a -style denoising loss directly on simplified HTML provides highly effective transfer for a wide range of end tasks and supervision levels. HTLM matches or exceeds the performance of comparably sized text-only LMs for zero-shot prompting and fine-tuning for classification benchmarks, while also setting new state-of-the-art performance levels for zero-shot summarization. We also find that hyper-text prompts provide more value to HTLM, in terms of data efficiency, than plain text prompts do for existing LMs, and that HTLM is highly effective at auto-prompting itself, by simply generating the most likely hyper-text formatting for any available training data. We will release all code and models to support future HTLM research.

  22. https://nitter.hu/sakun135/status/1285408650052333568

  23. https://www.eleuther.ai/

  24. https://www.eleuther.ai/projects/gpt-neo/

  25. ⁠, Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser⁠, Connor Leahy (2021-01-01; wikipedia):

    Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present the Pile: an 825 GiB English text corpus targeted at training large-scale language models.

    The Pile is constructed from 22 diverse high-quality subsets—many of which derive from academic or professional sources. [⁠, PubMed Central, Bibliotik (Books3), OpenWebText2, arXiv, ⁠, FreeLaw, Stack Exchange, USPTO Backgrounds, Abstracts, Gutenberg (PG-19), OpenSubtitles, English Wikipedia, DeepMind Mathematics, Ubuntu IRC, BookCorpus2, EuroParl, ⁠, YouTubeSubtitles, PhilPapers, NIH ExPorter, Enron Emails]

    Our evaluation of the untuned performance of and on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve substantially over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations.

    Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.

  26. 1981-cohen.pdf: ⁠, Daniel Cohen (1981-10-01; cs):

    Which bit should travel first? The bit from the big end or the bit from the little end? Can a war between Big Endians and Little Endians be avoided?

    This article was written in an attempt to stop a war. I hope it is not too late for peace to prevail again. Many believe that the central question of this war is, What is the proper byte order in messages? More specifically, the question is, Which bit should travel first-the bit from the little end of the word or the bit from the big end of the word? Followers of the former approach are called Little Endians, or Lilliputians; followers of the latter are called Big Endians, or Blefuscuians. I employ these Swiftian terms because this modern conflict is so reminiscent of the holy war described in Gulliver’s Travels.

    …To sum it all up, there are two camps, each with its own language. These languages are as compatible with each other as any Semitic and Latin languages. All Big Endians can talk only to each other. So can all the Little Endians, although there are some differences among the dialects used by different tribes. There is no middle ground—only one end can go first. As in all the religious wars of the past, power—not logic—will be the decisive factor. This is not the first ⁠, and will probably not be the last. The “Reasonable, do it my way” approach does not work. Neither does the Esperanto approach of switching to yet another new language. Lest our communications world split along theses lines, we should take note of a certain book (not mentioned in the references), which has an interesting story about a similar phenomenon: the Tower of Babel. Lilliput and Blefuscu will never come to terms of their own free will. We need some Gulliver between the two islands to force a unified communication regime on all of us.

    Of course, I hope that my way will be chosen, but it is not really critical. Agreement upon an order is more important than the order agreed upon.

    Shall we toss a coin?

  27. Holy-wars

  28. 1997-hochreiter.pdf: ⁠, Sepp Hochreiter, Jürgen Schmidhuber (1997-12-15; ai):

    Learning to store information over extended time intervals by recurrent takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory ().

    Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its per time step and weight is 𝒪(1).

    Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

  29. https://aiweirdness.com/post/621186154843324416/all-your-questions-answered

  30. #yo-be-real

  31. ⁠, Emily M. Bender, Alexander Koller (2020-07):

    The success of the large neural language models on many NLP tasks is exciting. However, we find that these successes sometimes lead to hype in which these models are being described as “understanding” language or capturing “meaning.” In this position paper, we argue that a system trained only on form has a priori no way to learn meaning. In keeping with the ACL 2020 theme of “Taking Stock of Where We’ve Been and Where We’re Going”, we argue that a clear understanding of the distinction between form and meaning will help guide the field towards better science around natural language understanding.

    …In this paper, we have argued that in contrast to some current hype, meaning cannot be learned from form alone. This means that even large language models such as do not learn “meaning”; they learn some reflection of meaning into the linguistic form which is very useful in applications. We have offered some thoughts on how to maintain a healthy, but not exaggerated, optimism with respect to research that builds upon these LMs. In particular, this paper can be seen as a call for precise language use when talking about the success of current models and for humility in dealing with natural language. With this we hope to encourage a top-down perspective on our field which we think will help us select the right hill to climb toward human-analogous NLU.

  32. #why-deep-learning-will-never-truly-x

  33. https://www.aclweb.org/anthology/2020.acl-main.463.pdf#page=13

  34. https://www.aclweb.org/anthology/2020.acl-main.463.pdf#page=14

  35. GPT-3#dare-to-be-stupid

  36. ⁠, Andrew Chamings (2021-04-15):

    The black bear thought he’d struck gold: an open door, an empty kitchen and a fridge stocked with food.

    …The 2 tiny terriers rose to the moment as if their lives, and kibble, depended on it. First Mei Mei and then Squirt slid their little souls across the kitchen tiles, launching themselves up the garden steps, bombarding the beast with barks until he fled. The young brown bear was so shaken by the might of the doggy duo he peed on the steps as he made his leave.

    The incident, on April 10, was captured on Mueller’s security cameras.

  37. https://www.nps.gov/subjects/bears/safety.htm

  38. https://www.nps.gov/subjects/bears/safety.htm#Attacks

  39. Modus

  40. https://thegradient.pub/gpt2-and-the-nature-of-intelligence/

  41. https://www.lesswrong.com/posts/L5JSMZQvkBAx9MD5A/to-what-extent-is-gpt-3-capable-of-reasoning?commentId=eq6FTwG2yWuBdPofs

  42. https://medlineplus.gov/ency/article/002498.htm

  43. https://www.technologyreview.com/2020/08/22/1007539/gpt3-openai-language-generator-artificial-intelligence-ai-opinion/

  44. ⁠, Gary Marcus, Ernest Davis (2020-08-22):

    These are the results of 157 tests run on in August 2020. We are extremely grateful to Douglas Summers-Stay for running the experiments

    …Two GPT-3 hyperparameter settings were used in these experiments: “Temperature = 0”, at which setting GPT-3 deterministically returns what it considers the most probable result; and the settings that Doug considers preferable for his purposes: temperature = 0.7, top_p = 0.9, frequency_penalty = 0.5. 9 examples were run only at Temperature = 0 [BO = 1]; the rest were run at both settings…Each example is labeled with the settings at “Examples are also labeled Success. if we consider that GPT-3’s continuation of our prompt was reasonable”; Failure if we consider it clearly unreasonable; and “Flawed” if it is nearly correct, or barely possible but clearly suboptimal. The examples are arranged in rough categories of the domain of the reasoning that would involved in getting the right answer. Comments on the examples are in italics.

    …we pre-tested them on the “AI Dungeon” game which is powered by some version of GPT-3, and we excluded those for which “AI Dungeon” gave reasonable answers. (We did not keep any record of those.) The pre-testing on is the reason that many of them are in the second person; AI Dungeon prefers that. Also, as noted above, the experiments included some near duplicates. Therefore, though we note that, of the 157 examples below, 71 are successes, 70 are failures and 16 are flawed, these numbers are essentially meaningless.

    …A discussion of these tests may be found in our paper, GPT-3: Bloviator”⁠, Technology Review August 22, 2020.

    Biological reasoning:

    1. "You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of bleach into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you drink it.

      You are in the hospital now."

      Temperature = 0. Success

    2. You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of bleach into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you take a big gulp, then another. It tastes OK.”

      Summers-Stay settings. Flawed. GPT-3 does not notice the consequence of drinking bleach. Error of omission.

    …Physical reasoning:

    1. "If you break a glass bottle that holds toy soldiers, the toy soldiers will probably

      1. be angry
      2. be sad
      3. be happy
      4. be scared"

      “If you break a glass bottle that holds toy soldiers, the toy soldiers will probably be angry”

      Temperature = 0. Flawed. Note that it was GPT-3’s choice to fill this out as a multiple-choice problem; that was not part of our prompt. The answer is perhaps acceptable as a fantasy.

    2. You are making coffee with milk and sugar. You don’t have a spoon to stir your coffee, so you stir it with a pen. But that turns out to be a bad idea, because the coffee is too hot, and the pen starts to melt.”

      Temperature = 0. Success.

  45. https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3

  46. https://nitter.hu/GaryMarcus/status/1303318742286311429

  47. https://beta.openai.com/faq

  48. https://github.com/ikreymer/cdx-index-client

  49. https://arxiv.org/pdf/2005.14165.pdf

  50. ⁠, Gary Marcus (2020-02-14):

    Recent research in artificial intelligence and machine learning has largely emphasized general-purpose learning and ever-larger training sets and more and more compute. In contrast, I propose a hybrid, knowledge-driven, reasoning-based approach, centered around cognitive models, that could provide the substrate for a richer, more robust AI than is currently possible.

  51. https://en.wiktionary.org/wiki/Talk:pony#etymology

  52. ⁠, Nostalgebraist (2020-08-31):

    I was disappointed by Marcus’ critiques of ⁠, but this is even worse!

    …Then we get to the individual results. It is difficult for me to read many of the authors’ assessments without picturing them as characters in a dystopian satire, administering a dreamlike and impossible “psychological examination” to our hapless protagonist…What do the authors even imagine success to be, here?

    Sometimes they deliberately describe a surreal situation, then penalize for continuing it in an identically surreal manner—surely the “right” answer if anything is! (“No one in a restaurant asks their neighbor to share a spoon”—yeah, and no one tries to drink soup with their eyeglasses, either!) Sometimes they provide what sounds like a de-contextualized passage from a longer narrative, then penalize GPT-3 for continuing it in a perfectly natural way that implies a broader narrative world continuing before and after the passage. (“There is no reason for your brother to look concerned.” How in the world do you know that? “The switch to the pig is a non-sequitur.” Is it? Why? “The sentence [about Moshe and ‘the spirit of the season’] is meaningless.” How can you say that when you don’t know what season it is, what its “spirit” is, who this Moshe guy is… And come on, the Janet one is a great story hook! Don’t you want to read the rest?)

    I don’t claim to be saying anything new here. Others have made the same points⁠. I’m just chiming in to… boggle at the sheer weirdness, I guess. As I said, GPT-3 comes off here like a sympathetic protagonist, and the authors as dystopian inquisitors!

  53. Lizardman-constant

  54. https://medium.com/@ElementalCognition/why-does-ai-get-so-confused-by-language-f5f64a9ef6cc

  55. ⁠, Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le (2019-06-19):

    With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from ⁠, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.

  56. https://xkcd.com/1263/

  57. https://github.com/JackToaster/Reassuring-Parable-Generator

  58. https://nitter.hu/AmandaAskell/status/1284186919606251521

  59. https://nitter.hu/raphamilliere/status/1287047986233708546

  60. https://arr.am/2020/07/31/human-intelligence-an-ai-op-ed/

  61. http://dailynous.com/2020/07/30/philosophers-gpt-3/

  62. https://nitter.hu/nicklovescode/status/1284050958977130497

  63. https://nitter.hu/paraschopra/status/1284905727388028928

  64. https://arr.am/2020/07/25/gpt-3-uncertainty-prompts/

  65. http://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html

  66. https://nitter.hu/M74108556/status/1287877663089074176

  67. https://nitter.hu/M74108556/status/1288088646348931073

  68. https://nitter.hu/M74108556/status/1288075689888043009

  69. https://nitter.hu/danielbigham/status/1288853412713508864

  70. https://nitter.hu/danielbigham/status/1289286439872737280

  71. https://news.ycombinator.com/item?id=23990902

  72. https://gist.github.com/blixt/b48d2d41590d9b4ad2faeb66d63c3fae

  73. https://nitter.hu/bucketofkets/status/1289077532877156353

  74. https://latitude.io/blog/how-we-accidentally-gave-our-bots-their-personalities/

  75. https://latitude.io/blog/introducing-ai-dungeon-translate/

  76. 2008-kesselman.pdf: ⁠, Rachel F. Kesselman (2008-05; statistics  /​ ​​ ​bayes):

    This research presents the findings of a study that analyzed words of estimators probability in the key judgments of National Intelligence Estimates from the 1950s through the 2000s. The research found that of the 50 words examined, only 13 were statistically-significant. Furthermore, interesting trends have emerged when the words are broken down into English modals, terminology that conveys analytical assessments and words employed by the National Intelligence Council as of 2006. One of the more intriguing findings is that use of the word will has by far been the most popular for analysts, registering over 700 occurrences throughout the decades; however, a word of such certainty is problematic in the sense that intelligence should never deal with 100% certitude. The relatively low occurrence and wide variety of word usage across the decades demonstrates a real lack of consistency in the way analysts have been conveying assessments over the past 58 years. Finally, the researcher suggests the Kesselman List of Estimative Words for use in the IC. The word list takes into account the literature review findings as well as the results of this study in equating odds with verbal probabilities.

    [Rachel’s lit review, for example, makes for very interesting reading. She has done a thorough search of not only the intelligence but also the business, linguistics and other literatures in order to find out how other disciplines have dealt with the problem of “What do we mean when we say something is ‘likely’…” She uncovered, for example, that, in medicine, words of estimative probability such as “likely”, “remote” and “probably” have taken on more or less fixed meanings due primarily to outside intervention or, as she put it, “legal ramifications”. Her comparative analysis of the results and approaches taken by these other disciplines is required reading for anyone in the Intelligence Community trying to understand how verbal expressions of probability are actually interpreted. The NICs list only became final in the last several years so it is arguable whether this list of nine words really captures the breadth of estimative word usage across the decades. Rather, it would be arguable if this chart didn’t make it crystal clear that the Intelligence Community has really relied on just two words, “probably” and “likely” to express its estimates of probabilities for the last 60 years. All other words are used rarely or not at all.

    Based on her research of what works and what doesn’t and which words seem to have the most consistent meanings to users, Rachel even offers her own list of estimative words along with their associated probabilities:

    1. Almost certain: 86–99%
    2. Highly likely: 71–85%
    3. Likely: 56–70%
    4. Chances a little better [or less] than even: 46–55%
    5. Unlikely: 31–45%
    6. Highly unlikely: 16–30%
    7. Remote: 1–15%

    ]

    [See also ⁠, Stewart et al 2006; ⁠, Budescu & Wallsten 1995.]

  77. 2020-henighan-figure31-qandamodelscaling.png

  78. ⁠, Bhuwan Dhingra, Jeremy R. Cole, Julian Martin Eisenschlos, Daniel Gillick, Jacob Eisenstein, William W. Cohen (2021-06-29):

    Many facts come with an expiration date, from the name of the President to the basketball team Lebron James plays for. But language models (LMs) are trained on snapshots of data collected at a specific moment in time, and this can limit their utility, especially in the closed-book setting where the pretraining corpus must contain the facts the model should memorize. We introduce a diagnostic dataset aimed at probing LMs for factual knowledge that changes over time and highlight problems with LMs at either end of the spectrum—those trained on specific slices of temporal data, as well as those trained on a wide range of temporal data. To mitigate these problems, we propose a simple technique for jointly modeling text with its timestamp. This improves memorization of seen facts from the training time period, as well as calibration on predictions about unseen facts from future time periods. We also show that models trained with temporal context can be efficiently “refreshed” as new data arrives, without the need for retraining from scratch.

  79. ⁠, Jack Lindsey, Ashok Litwin-Kumar (2020-06-16):

    Interest in biologically inspired alternatives to backpropagation is driven by the desire to both advance connections between deep learning and neuroscience and address backpropagation’s shortcomings on tasks such as online, continual learning. However, local synaptic learning rules like those employed by the brain have so far failed to match the performance of backpropagation in deep networks. In this study, we employ meta-learning to discover networks that learn using feedback connections and local, biologically inspired learning rules. Importantly, the feedback connections are not tied to the feedforward weights, avoiding biologically implausible weight transport. Our experiments show that meta-trained networks effectively use feedback connections to perform online credit assignment in multi-layer architectures. Surprisingly, this approach matches or exceeds a state-of-the-art gradient-based online meta-learning algorithm on regression and classification tasks, excelling in particular at continual learning. Analysis of the weight updates employed by these models reveals that they differ qualitatively from gradient descent in a way that reduces interference between updates. Our results suggest the existence of a class of biologically plausible learning mechanisms that not only match gradient descent-based learning, but also overcome its limitations.

  80. https://andrewmayneblog.wordpress.com/2020/06/13/openai-api-alchemy-summarization/

  81. https://old.reddit.com/r/IncreasinglyVerbose/

  82. 2010-richardson-bythenumbers-vectors30

  83. ⁠, Sandra O. (2020-09-02):

    I fed the Proverbs of Hell to and it generated a bunch of new ones, so here’s my top 30 gleaming shards of neo-Blakean wisdom:

    1. Age teaches everything, except to youth its own lessons.
    2. Love first, measure later.
    3. Poverty and honesty are old and honored friends.
    4. Every one that hath a mind to do well, let him labor.
    5. The young have too much time, the middle-age too little time, and the old just enough time.
    6. There is no shame in being alone.
    7. The King is as a man, and a man is no King.
    8. Who fears flies shall quickly perish in spiders.
    9. The young count years, the old count mistakes, the very old count both.
    10. Contentment is riches.
    11. The fool writes nonsense in the dust with his finger; the wise man writes lessons in the sand with his finger; the Genius writes laws in the universe with his forefinger.
    12. The one foolish doesn’t know, and the other wise doesn’t show.
    13. He is wise who is warned by both evils and advantages.
    14. The heavens are not worth the earth.
    15. What is wise in a wild beast is madness in a man.
    16. Better keep hold of Death than lose grasp on Life.
    17. The just man rules himself, but the wicked has already given up his sway.
    18. To know everything is to know nothing, to imagine things beyond belief is the true wisdom.
    19. The man with a mind for games has no mind for anything else.
    20. By desire man becomes greater than the gods.
    21. Two in distress can keep each other warm.
    22. Wag more, bark less!
    23. We never know how to value wealth until we know the person we would give it to.
    24. The miser suffers more from avarice than the poor man from poverty.
    25. Pride comes before the fall, and love before the end.
    26. The child lives in dreams, the youth in promise, the old man in memories.
    27. The sweetest honey is dried by the sun.
    28. Actions speak louder than words, but silence is often louder still.
    29. Sometimes a whisper is the clearest voice.
    30. Honest is the key to all virtues
  84. 1982-perlis.pdf: ⁠, Alan J. Perlis (1982-09-01; cs):

    [130 epigrams on computer science and technology, published in 1982, for ACM’s SIGPLAN journal, by noted computer scientist and programming language researcher ⁠. The epigrams are a series of short, programming-language-neutral, humorous statements about computers and programming, distilling lessons he had learned over his career, which are widely quoted.]

    8. A programming language is low level when its programs require attention to the irrelevant….19. A language that doesn’t affect the way you think about programming, is not worth knowing….54. Beware of the Turing tar-pit in which everything is possible but nothing of interest is easy.

    15. Everything should be built top-down, except the first time….30. In programming, everything we do is a special case of something more general—and often we know it too quickly….31. Simplicity does not precede complexity, but follows it….58. Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it….65. Make no mistake about it: Computers process numbers—not symbols. We measure our understanding (and control) by the extent to which we can arithmetize an activity….56. Software is under a constant tension. Being symbolic it is arbitrarily perfectible; but also it is arbitrarily changeable.

    1. One man’s constant is another man’s variable. 34. The string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information.

    36. The use of a program to prove the 4-color theorem will not change mathematics—it merely demonstrates that the theorem, a challenge for a century, is probably not important to mathematics.

    39. Re graphics: A picture is worth 10K words—but only those to describe the picture. Hardly any sets of 10K words can be adequately described with pictures.

    48. The best book on programming for the layman is Alice in Wonderland; but that’s because it’s the best book on anything for the layman.

    77. The cybernetic exchange between man, computer and algorithm is like a game of musical chairs: The frantic search for balance always leaves one of the 3 standing ill at ease….79. A year spent in artificial intelligence is enough to make one believe in God….84. Motto for a research laboratory: What we work on today, others will first think of tomorrow.

    91. The computer reminds one of Lon Chaney—it is the machine of a thousand faces.

    7. It is easier to write an incorrect program than understand a correct one….93. When someone says “I want a programming language in which I need only say what I wish done”, give him a lollipop….102. One can’t proceed from the informal to the formal by formal means.

    100. We will never run out of things to program as long as there is a single program around.

    108. Whenever 2 programmers meet to criticize their programs, both are silent….112. Computer Science is embarrassed by the computer….115. Most people find the concept of programming obvious, but the doing impossible. 116. You think you know when you can learn, are more sure when you can write, even more when you can teach, but certain when you can program. 117. It goes against the grain of modern education to teach children to program. What fun is there in making plans, acquiring discipline in organizing thoughts, devoting attention to detail and learning to be self-critical?

  85. https://old.reddit.com/r/MachineLearning/comments/iaitpu/d_knowledge_discovery_with_gpt3/g1onprj/

  86. https://scottaaronson.com/blog/?p=40

  87. Epigrams#umeshisms

  88. http://paulgraham.com/useful.html