In this informal article, I’ll describe the “recognition method”—a simple, powerful technique for memorization and mental calculation. Compared to traditional memorization techniques, which use elaborate encoding and visualization processes , the recognition method is easy to learn and requires relatively little effort…The method works: using it, I was able to mentally multiply two random 10-digit numbers, by the usual grade-school algorithm, on my first attempt! I have a normal, untrained memory, and the task would have been impossible by a direct approach. (I can’t claim I was speedy: I worked slowly and carefully, using about 7 hours plus rest breaks. I practiced twice with 5-digit numbers beforehand.)
…It turns out that ordinary people are incredibly good at this task [recognizing whether a photograph has been seen before]. In one of the most widely-cited studies on recognition memory, Standing  showed participants an epic 10,000 photographs over the course of 5 days, with 5 seconds’ exposure per image. He then tested their familiarity, essentially as described above. The participants showed an 83% success rate, suggesting that they had become familiar with about 6,600 images during their ordeal. Other volunteers, trained on a smaller collection of 1,000 images selected for vividness, had a 94% success rate.
1973-standing.pdf: “Learning 10000 pictures”, (1973-05-01; ):
Four experiments are reported which examined memory capacity and retrieval speed for pictures and for words. Single-trial learning tasks were employed throughout, with memory performance assessed by forced-choice recognition, recall measures or choice reaction-time tasks. The main experimental findings were: (1) memory capacity, as a function of the amount of material presented, follows a general power law with a characteristic exponent for each task; (2) pictorial material obeys this power law and shows an overall superiority to verbal material. The capacity of recognition memory for pictures is almost limitless, when measured under appropriate conditions; (3) when the recognition task is made harder by using more alternatives, memory capacity stays constant and the superiority of pictures is maintained; (4) picture memory also exceeds verbal memory in terms of verbal recall; comparable recognition/recall ratios are obtained for pictures, words and nonsense syllables; (5) verbal memory shows a higher retrieval speed than picture memory, as inferred from reaction-time measures. Both types of material obey a , when reaction-time is measured for various sizes of learning set, and both show very rapid rates of memory search.
From a consideration of the experimental results and other data it is concluded that the superiority of the pictorial mode in recognition and free recall learning tasks is well established and cannot be attributed to methodological artifact.
[Excerpts from If a Lion Could Talk: Animal Intelligence and the Evolution of Consciousness, Budiansky 1998 (ISBN 0684837102).]
How many of us have caught ourselves gazing into the eyes of a pet, wondering what thoughts lie behind those eyes? Or fallen into an argument over which is smarter, the dog or the cat? Scientists have conducted elaborate experiments trying to ascertain whether animals from chimps to pigeons can communicate, count, reason, or even lie. So does science tell us what we assume—that animals are pretty much like us, only not as smart? Simply, no. Now, in this superb book, Stephen Budiansky poses the fundamental question: “What is intelligence?” His answer takes us on the ultimate wildlife adventure to animal consciousness. Budiansky begins by exposing our tendency to see ourselves in animals. Our anthropomorphism allows us to perceive intelligence only in behavior that mimics our own. This prejudice, he argues, betrays a lack of imagination. Each species is so specialized that most of their abilities are simply not comparable. At the mercy of our anthropomorphic tendencies, we continue to puzzle over pointless issues like whether a wing or an arm is better, or whether night vision is better than day vision, rather than discovering the real world of a winged nighthawk, a thoroughbred horse, or an African lion. Budiansky investigates the sometimes bizarre research behind animal intelligence experiments: from horses who can count or ace history quizzes, and primates who seem fluent in sign language, to rats who seem to have become self-aware, he reveals that often these animals are responding to our tiny unconscious cues. And, while critically discussing scientists’ interpretations of animal intelligence, he is able to lay out their discoveries in terms of what we know about ourselves. For instance, by putting you in the minds of dogs or bees who travel by dead reckoning, he demonstrates that this is also how you find your way down a familiar street with almost no conscious awareness of your navigation system. Modern cognitive science and the new science of evolutionary ecology are beginning to show that thinking in animals is tremendously complex and wonderful in its variety. A pigeon’s ability to find its way home from almost anywhere has little to do with comparative intelligence; rather it is due to the pigeon’s very different perception of the world. That’s why, as Wittgenstein said, “If a lion could talk, we would not understand him.” In this fascinating book, Budiansky frees us from the shackles of our ideas about the natural world, and opens a window to the astounding worlds of the animals that surround us.
The human genome is arguably the most complete mammalian reference assembly, yet more than 160 euchromatic gaps remain and aspects of its structural variation remain poorly understood ten years after its completion. To identify missing sequence and genetic variation, here we sequence and analyse a haploid human genome (CHM1) using single-molecule, real-time DNA sequencing. We close or extend 55% of the remaining interstitial gaps in the human GRCh37 reference genome–78% of which carried long runs of degenerate short tandem repeats, often several kilobases in length, embedded within (G+C)-rich genomic regions. We resolve the complete sequence of 26,079 euchromatic structural variants at the base-pair level, including inversions, complex insertions and long tracts of tandem repeats. Most have not been previously reported, with the greatest increases in sensitivity occurring for events less than 5 kilobases in size. Compared to the human reference, we find a significant insertional bias (3:1) in regions corresponding to complex insertions and long short tandem repeats. Our results suggest a greater complexity of the human genome in the form of variation of longer and more complex repetitive DNA that can now be largely resolved with the application of this longer-read sequencing technology.
“Genome Graphs and the Evolution of Genome Inference”, (2017-03-14):
The human reference genome is part of the foundation of modern human biology, and a monumental scientific achievement. However, because it excludes a great deal of common human variation, it introduces a pervasive reference bias into the field of human genomics. To reduce this bias, it makes sense to draw on representative collections of human genomes, brought together into reference cohorts. There are a number of techniques to represent and organize data gleaned from these cohorts, many using ideas implicitly or explicitly borrowed from graph based models. Here, we survey various projects underway to build and apply these graph based structures—which we collectively refer to as genome graphs—and discuss the improvements in read mapping, variant calling, and haplotype determination that genome graphs are expected to produce.
2015-polderman.pdf: “Meta-analysis of the heritability of human traits based on fifty years of twin studies”, (2015-05-18; ):
Despite a century of research on complex traits in humans, the relative importance and specific nature of the influences of genes and environment on human traits remain controversial. We report a meta-analysis of twin correlations and reported variance components for 17,804 traits from 2,748 publications including 14,558,903 partly dependent twin pairs, virtually all published twin studies of complex traits. Estimates of heritability cluster strongly within functional domains, and across all traits the reported heritability is 49%. For a majority (69%) of traits, the observed twin correlations are consistent with a simple and parsimonious model where twin resemblance is solely due to additive genetic variation. The data are inconsistent with substantial influences from shared environment or non-additive genetic variation. This study provides the most comprehensive analysis of the causes of individual differences in human traits thus far and will guide future gene-mapping efforts. All the results can be visualized using the MaTCH webtool.
“Phenome-wide Heritability Analysis of the UK Biobank”, (2016-08-18):
Heritability estimation provides important information about the relative contribution of genetic and environmental factors to phenotypic variation, and provides an upper bound for the utility of genetic risk prediction models. Recent technological and statistical advances have enabled the estimation of additive heritability attributable to common genetic variants (SNP heritability) across a broad phenotypic spectrum. However, assessing the comparative heritability of multiple traits estimated in different cohorts may be misleading due to the population-specific nature of heritability. Here we report the SNP heritability for 551 complex traits derived from the large-scale, population-based UK Biobank, comprising both quantitative phenotypes and disease codes, and examine the moderating effect of three major demographic variables (age, sex and socioeconomic status) on the heritability estimates. Our study represents the first comprehensive phenome-wide heritability analysis in the UK Biobank, and underscores the importance of considering population characteristics in comparing and interpreting heritability.
Psychological studies have shown that personality traits are associated with book preferences. However, past findings are based on questionnaires focusing on conventional book genres and are unrepresentative of niche content. For a more comprehensive measure of book content, this study harnesses a massive archive of content labels, also known as ‘tags’, created by users of an online book catalogue, Goodreads.com. Combined with data on preferences and personality scores collected from Facebook users, the tag labels achieve high accuracy in personality prediction by psychological standards. We also group tags into broader genres, to check their validity against past findings. Our results are robust across both tag and genre levels of analyses, and consistent with existing literature. Moreover, user-generated tag labels reveal unexpected insights, such as cultural differences, book reading behaviors, and other non-content factors affecting preferences. To our knowledge, this is currently the largest study that explores the relationship between personality and book content preferences.
The proliferation of media streaming services has increased the volume of personalized video consumption, allowing marketers to reach massive audiences and deliver a range of customized content at scale. However, relatively little is known about how consumers’ psychological makeup is manifested in their media preferences. The present paper addresses this gap in a preregistered study of the relationship between movie plots, quantified via user-generated keywords, and the aggregate personality profiles of those who “like” them on social media. We find that movie plots can be used to accurately predict aggregate fans’ personalities, above and beyond the demographic characteristics of fans, and general film characteristics such as quality, popularity, and genre. Further analysis reveals various associations between the movies’ psychological themes and their fans’ personalities, indicating congruence between the two. For example, films with keywords related to anxiety are liked more among people who are high in Neuroticism and low in Extraversion. In contrast, angry and violent movies are liked more by people who are low in Agreeableness. Our findings provide a fine-grained mapping between personality dimensions and preferences for media content, and demonstrate how these links can be leveraged for assessing audience psychographics at scale.
Modern society depends on the flow of information over online social networks, and users of popular platforms generate significant behavioral data about themselves and their social ties. However, it remains unclear what fundamental limits exist when using these data to predict the activities and interests of individuals, and to what accuracy such predictions can be made using an individual’s social ties. Here we show that 95% of the potential predictive accuracy for an individual is achievable using their social ties only, without requiring that individual’s data. We use information theoretic tools to estimate the predictive information within the writings of Twitter users, providing an upper bound on the available predictive information that holds for any predictive or machine learning methods. As few as 8–9 of an individual’s contacts are sufficient to obtain predictability comparable to that of the individual alone. Distinct temporal and social effects are visible by measuring information flow along social ties, allowing us to better study the dynamics of online activity. Our results have distinct privacy implications: information is so strongly embedded in a social network that in principle one can profile an individual from their available social ties even when the individual forgoes the platform completely.
“Inferring Human Traits From Facebook Statuses”, (2018-05-22):
This paper explores the use of language models to predict 20 human traits from users’ Facebook status updates. The data was collected by the myPersonality project, and includes user statuses along with their personality, gender, political identification, religion, race, satisfaction with life, IQ, self-disclosure, fair-mindedness, and belief in astrology. A single interpretable model meets state of the art results for well-studied tasks such as predicting gender and personality; and sets the standard on other traits such as IQ, sensational interests, political identity, and satisfaction with life. Additionally, highly weighted words are published for each trait. These lists are valuable for creating hypotheses about human behavior, as well as for understanding what information a model is extracting. Using performance and extracted features we analyze models built on social media. The real world problems we explore include gendered classification bias and Cambridge Analytica’s use of psychographic models.
The understanding, quantification and evaluation of individual differences in behavior, feelings and thoughts have always been central topics in psychological science. An enormous amount of previous work on individual differences in behavior is exclusively based on data from self-report questionnaires. To date, little is known about how individuals actually differ in their objectively quantifiable behaviors and how differences in these behaviors relate to big five personality traits. Technological advances in mobile computer and sensing technology have now created the possibility to automatically record large amounts of data about humans’ natural behavior. The collection and analysis of these records makes it possible to analyze and quantify behavioral differences at unprecedented scale and efficiency. In this study, we analyzed behavioral data obtained from 743 participants in 30 consecutive days of smartphone sensing (25,347,089 logging-events). We computed variables (15,692) about individual behavior from five semantic categories (communication & social behavior, music listening behavior, app usage behavior, mobility, and general daytime & nighttime activity). Using a machine learning approach (random forest, elastic net), we show how these variables can be used to predict self-assessments of the big five personality traits at the factor and facet level. Our results reveal distinct behavioral patterns that proved to be differentially-predictive of big five personality traits. Overall, this paper shows how a combination of rich behavioral data obtained with smartphone sensing and the use of machine learning techniques can help to advance personality research and can inform both practitioners and researchers about the different behavioral patterns of personality.
2019-gladstone.pdf: “Can Psychological Traits Be Inferred From Spending? Evidence From Transaction Data”, (2019-01-01; ):
The automatic assessment of psychological traits from digital footprints allows researchers to study psychological traits at unprecedented scale and in settings of high ecological validity. In this research, we investigated whether spending records—a ubiquitous and universal form of digital footprint—can be used to infer psychological traits. We applied an ensemble machine-learning technique ( random-forest modeling) to a data set combining two million spending records from bank accounts with survey responses from the account holders (n = 2,193). Our predictive accuracies were modest for the Big Five personality traits ( r = 0.15, corrected ρ = 0.21) but provided higher precision for specific traits, including materialism ( r = 0.33, corrected ρ = 0.42). We compared the predictive accuracy of these models with the predictive accuracy of alternative digital behaviors used in past research, including those observed on social media platforms, and we found that the predictive accuracies were relatively stable across socioeconomic groups and over time.
1978-cover.pdf: “A Convergent Gambling Estimate Of The Entropy Of English”, (1978-07-01; ):
In his original paper on the subject, Shannon found upper and lower bounds for the entropy of printed English based on the number of trials required for a subject to guess subsequent symbols in a given text. The guessing approach precludes asymptotic consistency of either the upper or lower bounds except for degenerate ergodic processes.
Shannon’s technique of guessing the next symbol is altered by having the subject place sequential bets on the next symbol of text. If Sn denotes the subject’s capital after n bets at 27 for 1 odds, and if it is assumed that the subject knows the underlying probability distribution for the process X, then the entropy estimate is Ĥn(X) = (1 − (1⁄n) log27 Sn) log2 27 bits/symbol. If the subject does not know the true probability distribution for the stochastic process, then Ĥn(X) is an asymptotic upper bound for the true Shannon-McMillan-Breiman theorem Ĥn(X) → H(X) with probability one.. If X is stationary, EĤn(X) → H(X), H(X) being the true entropy of the process. Moreover, if X is ergodic, then by the
Preliminary indications are that English text has anof approximately 1.3 bits/symbol, which agrees well with Shannon’s estimate.
2002-behr.pdf: “Estimating and Comparing Entropy across Written Natural Languages Using PPM Compression”, (2002; ):
Previous work on estimating the the results of PPM compression on machine-generated and human-generated translations of texts into various languages. Under the assumption that languages are equally expressive, and that PPM compression does well across languages, one would expect that translated documents would compress to approximately the same size. We verify this empirically on a novel corpus of translated documents. We suggest as an application of this finding using the size of compressed natural language texts as a mean of automatically testing translation quality.of written natural language has focused primarily on English. We expand this work by considering other natural languages, including Arabic, Chinese, French, Greek, Japanese, Korean, Russian, and Spanish. We present
We argue that Non-sequential Recursive Pair Substitution (NSRPS) as suggested by Jiménez-Montaño and Ebeling can indeed be used as a basis for an optimal data compression algorithm.
In particular, we prove for Markov sequences that NSRPS together with suitable codings of the substitutions and of the substitute series does not lead to a code length increase, in the limit of infinite sequence length. When applied to written English, NSRPS gives estimates which are very close to those obtained by other methods.
Using ca. 135 GB of input data from the project Gutenberg, we estimate the effectiveto be ~1.82 bit/character. Extrapolating to infinitely long input, the true value of the entropy is estimated as ~0.8 bit/character.
Language is universal, but it has few indisputably universal characteristics, with cross-linguistic variation being the norm. For example, languages differ greatly in the number of syllables they allow, resulting in large variation in the Shannon information per syllable. Nevertheless, all natural languages allow their speakers to efficiently encode and transmit information. We show here, using quantitative methods on a large cross-linguistic corpus of 17 languages, that the coupling between language-level (information per syllable) and speaker-level (speech rate) properties results in languages encoding similar information rates (~39 bits/s) despite wide differences in each property individually: Languages are more similar in information rates than in Shannon information or speech rate. These findings highlight the intimate feedback loops between languages’ structural properties and their speakers’ neurocognition and biology under communicative pressures. Thus, language is the product of a multiscale communicative niche construction process at the intersection of biology, environment, and culture.
“Universal Entropy of Word Ordering Across Linguistic Families”, (2011-04-19):
Background: The language faculty is probably the most distinctive feature of our species, and endows us with a unique ability to exchange highly structured information. In written language, information is encoded by the concatenation of basic symbols under grammatical and semantic constraints. As is also the case in other natural information carriers, the resulting symbolic sequences show a delicate balance between order and disorder. That balance is determined by the interplay between the diversity of symbols and by their specific ordering in the sequences. Here we usedto quantify the contribution of different organizational levels to the overall statistical structure of language.
Methodology/Principal Findings: We computed a relative entropy measure to quantify the degree of ordering in word sequences from languages belonging to several linguistic families. While a direct estimation of the overallof language yielded values that varied for the different families considered, the relative entropy quantifying word ordering presented an almost constant value for all those families.
Conclusion: Our results indicate that despite the differences in the structure and vocabulary of the languages analyzed, the impact of word ordering in the structure of language is a statistical linguistic universal.
2011-pellegrino.pdf: “A Cross-Language Perspective On Speech Information Rate”, (2011-09; ):
This article is a cross-linguistic investigation of the hypothesis that the average information rate conveyed during speech communication results from a trade-off between average information density and speech rate. The study, based on seven languages, shows a negative correlation between density and rate, indicating the existence of several encoding strategies. However, these strategies do not necessarily lead to a constant information rate. These results are further investigated in relation to the notion of syllabic complexity.
The emergence of a complex language is one of the fundamental events of human evolution, and several remarkable features suggest the presence of fundamental principles of organization. These principles seem to be common to all languages. The best known is the so-called Zipf’s law, which states that the frequency of a word decays as a (universal)of its rank. The possible origins of this law have been controversial, and its meaningfulness is still an open question. In this article, the early hypothesis of Zipf of a principle of least effort for explaining the law is shown to be sound. Simultaneous minimization in the effort of both hearer and speaker is formalized with a simple optimization process operating on a binary matrix of signal-object associations. Zipf’s law is found in the transition between referentially useless systems and indexical reference systems. Our finding strongly suggests that Zipf’s law is a hallmark of symbolic reference and not a meaningless feature. The implications for the evolution of language are discussed. We explain how language evolution can take advantage of a communicative phase transition.