Skip to main content

statistics/​comparison directory


“PiRank: Learning To Rank via Differentiable Sorting”, Swezey et al 2020

“PiRank: Learning To Rank via Differentiable Sorting”⁠, Robin Swezey, Aditya Grover, Bruno Charron, Stefano Ermon (2020-12-12; ; backlinks; similar):

A key challenge with machine learning approaches for ranking is the gap between the performance metrics of interest and the surrogate loss functions that can be optimized with gradient-based methods. This gap arises because ranking metrics typically involve a sorting operation which is not differentiable w.r.t. the model parameters. Prior works have proposed surrogates that are loosely related to ranking metrics or simple smoothed versions thereof. We propose PiRank, a new class of differentiable surrogates for ranking, which employ a continuous, temperature-controlled relaxation to the sorting operator. We show that PiRank exactly recovers the desired metrics in the limit of zero temperature and scales favorably with the problem size, both in theory and practice. Empirically, we demonstrate that PiRank significantly improves over existing approaches on publicly available internet-scale learning-to-rank benchmarks.

“Rank-Smoothed Pairwise Learning In Perceptual Quality Assessment”, Talebi et al 2020

2020-talebi.pdf#google: “Rank-Smoothed Pairwise Learning In Perceptual Quality Assessment”⁠, Hossein Talebi, Ehsan Amid, Peyman Milanfar, Manfred K. Warmuth (2020-09-30; ; backlinks; similar):

Conducting pairwise comparisons is a widely used approach in curating human perceptual preference data. Typically raters are instructed to make their choices according to a specific set of rules that address certain dimensions of image quality and aesthetics. The outcome of this process is a dataset of sampled image pairs with their associated empirical preference probabilities. Training a model on these pairwise preferences is a common deep learning approach. However, optimizing by gradient descent through mini-batch learning means that the “global” ranking of the images is not explicitly taken into account. In other words, each step of the gradient descent relies only on a limited number of pairwise comparisons. In this work, we demonstrate that regularizing the pairwise empirical probabilities with aggregated rankwise probabilities leads to a more reliable training loss. We show that training a deep image quality assessment model with our rank-smoothed loss consistently improves the accuracy of predicting human preferences.

“Self-Play Learning Without a Reward Metric”, Schmidt et al 2019

“Self-Play Learning Without a Reward Metric”⁠, Dan Schmidt, Nick Moran, Jonathan S. Rosenfeld, Jonathan Rosenthal, Jonathan Yedidia (2019-12-16; ⁠, ; backlinks; similar):

The AlphaZero algorithm for the learning of strategy games via self-play, which has produced superhuman ability in the games of Go, chess, and shogi, uses a quantitative reward function for game outcomes, requiring the users of the algorithm to explicitly balance different components of the reward against each other, such as the game winner and margin of victory. We present a modification to the AlphaZero algorithm that requires only a total ordering over game outcomes, obviating the need to perform any quantitative balancing of reward components. We demonstrate that this system learns optimal play in a comparable amount of time to AlphaZero on a sample game.

“GPT-2 Preference Learning for Music Generation”, Branwen 2019

GPT-2-preference-learning: “GPT-2 Preference Learning for Music Generation”⁠, Gwern Branwen (2019-12-16; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Experiments with OpenAI’s ‘preference learning’ approach, which trains a NN to predict global quality of datapoints, and then uses reinforcement learning to optimize that directly, rather than proxies. I am unable to improve quality, perhaps due to too-few ratings.

Standard language generation neural network models, like GPT-2⁠, are trained via likelihood training to imitate human text corpuses. Generated text suffers from persistent flaws like repetition, due to myopic generation word-by-word, and cannot improve on the training data because they are trained to predict ‘realistic’ completions of the training data.

A proposed alternative is to use reinforcement learning to train the NNs, to encourage global properties like coherence & lack of repetition, and potentially improve over the original corpus’s average quality. Preference learning trains a reward function on human ratings, and uses that as the ‘environment’ for a blackbox DRL algorithm like PPO⁠.

OpenAI released a codebase implementing this dual-model preference learning approach for textual generation, based on GPT-2. Having previously used GPT-2 for poetry & music generation⁠, I experimented with GPT-2 preference learning for unconditional music and poetry generation.

I found that preference learning seemed to work better for music than poetry, and seemed to reduce the presence of repetition artifacts, but the results, at n ≈ 7,400 ratings compiled over 23 iterations of training+sampling November 2019–January 2020, are not dramatically better than alternative improvements like scaling up models or more thorough data-cleaning or more stringent sample curation. My blind ratings using n ≈ 200 comparisons showed no large advantage for the RL-tuned samples (winning only 93 of 210 comparisons, or 46%).

This may be due to insufficient ratings, bad hyperparameters, or not using samples generated with common prefixes, but I suspect it’s the former, as some NLP tasks in Ziegler et al 2019 required up to 60k ratings for good performance, and the reward model appeared to achieve poor performance & succumb to adversarial examples easily.

Working with it, I suspect that preference learning is unnecessarily sample-inefficient & data-inefficient, and that the blackbox reinforcement learning approach is inferior to directly using the reward model to optimize text samples, and propose two major architectural overhauls: have the reward model directly model the implied ranking of every datapoint, and drop the agent model entirely in favor of backprop-powered gradient ascent which optimizes sequences to maximize the reward model’s output⁠.

“How Should We Critique Research?”, Branwen 2019

Research-criticism: “How Should We Critique Research?”⁠, Gwern Branwen (2019-05-19; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Criticizing studies and statistics is hard in part because so many criticisms are possible, rendering them meaningless. What makes a good criticism is the chance of being a ‘difference which makes a difference’ to our ultimate actions.

Scientific and statistical research must be read with a critical eye to understand how credible the claims are. The Reproducibility Crisis and the growth of meta-science have demonstrated that much research is of low quality and often false.

But there are so many possible things any given study could be criticized for, falling short of an unobtainable ideal, that it becomes unclear which possible criticism is important, and they may degenerate into mere rhetoric. How do we separate fatal flaws from unfortunate caveats from specious quibbling?

I offer a pragmatic criterion: what makes a criticism important is how much it could change a result if corrected and how much that would then change our decisions or actions: to what extent it is a “difference which makes a difference”.

This is why issues of research fraud, causal inference, or biases yielding overestimates are universally important: because a ‘causal’ effect turning out to be zero effect or grossly overestimated will change almost all decisions based on such research; while on the other hand, other issues like measurement error or distributional assumptions, which are equally common, are often not important: because they typically yield much smaller changes in conclusions, and hence decisions.

If we regularly ask whether a criticism would make this kind of difference, it will be clearer which ones are important criticisms, and which ones risk being rhetorical distractions and obstructing meaningful evaluation of research.

“Group Testing: An Information Theory Perspective”, Aldridge et al 2019

2019-aldridge.pdf: “Group Testing: An Information Theory Perspective”⁠, Aldridge, Matthew, Johnson, Oliver, Scarlett, Jonathan (2019-01-01)

“Open Questions”, Branwen 2018

Questions: “Open Questions”⁠, Gwern Branwen (2018-10-17; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Some anomalies/​questions which are not necessarily important, but do puzzle me or where I find existing explanations to be unsatisfying.

? ? ?

A list of some questions which are not necessarily important, but do puzzle me or where I find existing ‘answers’ to be unsatisfying, categorized by subject (along the lines of Patrick Collison’s list & Alex Guzey⁠; see also my list of project ideas).

“OptionGAN: Learning Joint Reward-Policy Options Using Generative Adversarial Inverse Reinforcement Learning”, Henderson et al 2017

“OptionGAN: Learning Joint Reward-Policy Options using Generative Adversarial Inverse Reinforcement Learning”⁠, Peter Henderson, Wei-Di Chang, Pierre-Luc Bacon, David Meger, Joelle Pineau, Doina Precup (2017-09-20; ⁠, ; backlinks; similar):

Reinforcement learning has shown promise in learning policies that can solve complex problems. However, manually specifying a good reward function can be difficult, especially for intricate tasks. Inverse reinforcement learning offers an useful paradigm to learn the underlying reward function directly from expert demonstrations. Yet in reality, the corpus of demonstrations may contain trajectories arising from a diverse set of underlying reward functions rather than a single one. Thus, in inverse reinforcement learning, it is useful to consider such a decomposition. The options framework in reinforcement learning is specifically designed to decompose policies in a similar light. We therefore extend the options framework and propose a method to simultaneously recover reward options in addition to policy options. We leverage adversarial methods to learn joint reward-policy options using only observed expert states. We show that this approach works well in both simple and complex continuous control tasks and shows significant performance increases in one-shot transfer learning.

“Analogical-based Bayesian Optimization”, Le et al 2017

“Analogical-based Bayesian Optimization”⁠, Trung Le, Khanh Nguyen, Tu Dinh Nguyen, Dinh Phung (2017-09-19; ; backlinks; similar):

Some real-world problems revolve to solve the optimization problem maxx ∈ 𝓍f(x) where f(.) is a black-box function and X might be the set of non-vectorial objects (eg. distributions) where we can only define a symmetric and non-negative similarity score on it. This setting requires a novel view for the standard framework of Bayesian Optimization that generalizes the core insightful spirit of this framework.

With this spirit, in this paper, we propose Analogical-based Bayesian Optimization that can maximize black-box function over a domain where only a similarity score can be defined. Our pathway is as follows: we first base on the geometric view of Gaussian Processes (GP) to define the concept of influence level that allows us to analytically represent predictive means and variances of GP posteriors and base on that view to enable replacing kernel similarity by a more genetic similarity score. Furthermore, we also propose two strategies to find a batch of query points that can efficiently handle high dimensional data.

“Spectral Method and Regularized MLE Are Both Optimal for Top-K Ranking”, Chen et al 2017

“Spectral Method and Regularized MLE Are Both Optimal for Top-K Ranking”⁠, Yuxin Chen, Jianqing Fan, Cong Ma, Kaizheng Wang (2017-07-31; backlinks; similar):

This paper is concerned with the problem of top-K ranking from pairwise comparisons. Given a collection of n items and a few pairwise comparisons across them, one wishes to identify the set of K items that receive the highest ranks.

To tackle this problem, we adopt the logistic parametric model—the Bradley-Terry-Luce model, where each item is assigned a latent preference score, and where the outcome of each pairwise comparison depends solely on the relative scores of the two items involved. Recent works have made substantial progress towards characterizing the performance (eg. the mean square error for estimating the scores) of several classical methods, including the spectral method and the maximum likelihood estimator (MLE). However, where they stand regarding top-K ranking remains unsettled.

We demonstrate that under a natural random sampling model, the spectral method alone, or the regularized MLE alone, is minimax optimal in terms of the sample complexity—the number of paired comparisons needed to ensure exact top-K identification, for the fixed dynamic range regime. This is accomplished via optimal control of the entrywise error of the score estimates.

We complement our theoretical studies by numerical experiments, confirming that both methods yield low entrywise errors for estimating the underlying scores. Our theory is established via a novel leave-one-out trick, which proves effective for analyzing both iterative and non-iterative procedures. Along the way, we derive an elementary eigenvector perturbation bound for probability transition matrices, which parallels the Davis-Kahan sinΘ theorem for symmetric matrices. This also allows us to close the gap between the 𝓁2 error upper bound for the spectral method and the minimax lower limit.

“Deep Reinforcement Learning from Human Preferences”, Christiano et al 2017

“Deep reinforcement learning from human preferences”⁠, Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei (2017-06-12; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent’s interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any that have been previously learned from human feedback.

“D-TS: Double Thompson Sampling for Dueling Bandits”, Wu & Liu 2016

“D-TS: Double Thompson Sampling for Dueling Bandits”⁠, Huasen Wu, Xin Liu (2016-04-25; ; backlinks; similar):

In this paper, we propose a Double Thompson Sampling (D-TS) algorithm for dueling bandit problems. As indicated by its name, D-TS selects both the first and the second candidates according to Thompson Sampling⁠.

Specifically, D-TS maintains a posterior distribution for the preference matrix, and chooses the pair of arms for comparison by sampling twice from the posterior distribution. This simple algorithm applies to general Copeland dueling bandits, including Condorcet dueling bandits as its special case.

For general Copeland dueling bandits, we show that D-TS achieves \U0001D442(K2 log T) regret⁠. For Condorcet dueling bandits, we further simplify the D-TS algorithm and show that the simplified D-TS algorithm achieves \U0001D442(K log T + K2 log log T) regret. Simulation results based on both synthetic and real-world data demonstrate the efficiency of the proposed D-TS algorithm.

“Resorting Media Ratings”, Branwen 2015

Resorter: “Resorting Media Ratings”⁠, Gwern Branwen (2015-09-07; ⁠, ; backlinks; similar):

Commandline tool providing interactive statistical pairwise ranking and sorting of items

User-created datasets using ordinal scales (such as media ratings) have tendencies to drift or ‘clump’ towards the extremes and fail to be informative as possible, falling prey to ceiling effects and making it difficult to distinguish between the mediocre & excellent.

This can be counteracted by rerating the dataset to create an uniform (and hence, informative) distribution of ratings—but such manual rerating is difficult.

I provide an anytime CLI program, resorter, written in R (should be cross-platform but only tested on Linux) which keeps track of comparisons, infers underlying ratings assuming that they are noisy in the ELO-like Bradley-Terry model⁠, and interactively & intelligently queries the user with comparisons of the media with the most uncertain current ratings, until the user ends the session and a fully rescaled set of ratings are output.

“Just Sort It! A Simple and Effective Approach to Active Preference Learning”, Maystre & Grossglauser 2015

“Just Sort It! A Simple and Effective Approach to Active Preference Learning”⁠, Lucas Maystre, Matthias Grossglauser (2015-02-19; ; backlinks; similar):

We address the problem of learning a ranking by using adaptively chosen pairwise comparisons. Our goal is to recover the ranking accurately but to sample the comparisons sparingly. If all comparison outcomes are consistent with the ranking, the optimal solution is to use an efficient sorting algorithm, such as Quicksort. But how do sorting algorithms behave if some comparison outcomes are inconsistent with the ranking?

We give favorable guarantees for Quicksort for the popular Bradley-Terry model, under natural assumptions on the parameters. Furthermore, we empirically demonstrate that sorting algorithms lead to a very simple and effective active learning strategy: repeatedly sort the items. This strategy performs as well as state-of-the-art methods (and much better than random sampling) at a minuscule fraction of the computational cost.

“On the Complexity of Best Arm Identification in Multi-Armed Bandit Models”, Kaufmann et al 2014

“On the Complexity of Best Arm Identification in Multi-Armed Bandit Models”⁠, Emilie Kaufmann, Olivier Cappé, Aurélien Garivier (2014-07-16; ; backlinks; similar):

The stochastic multi-armed bandit model is a simple abstraction that has proven useful in many different contexts in statistics and machine learning. Whereas the achievable limit in terms of regret minimization is now well known, our aim is to contribute to a better understanding of the performance in terms of identifying the m best arms. We introduce generic notions of complexity for the two dominant frameworks considered in the literature: fixed-budget and fixed-confidence settings.

In the fixed-confidence setting, we provide the first known distribution-dependent lower bound on the complexity that involves information-theoretic quantities and holds when m is larger than 1 under general assumptions. In the specific case of two armed-bandits, we derive refined lower bounds in both the fixed-confidence and fixed-budget settings, along with matching algorithms for Gaussian and Bernoulli bandit models. These results show in particular that the complexity of the fixed-budget setting may be smaller than the complexity of the fixed-confidence setting, contradicting the familiar behavior observed when testing fully specified alternatives.

In addition, we also provide improved sequential stopping rules that have guaranteed error probabilities and shorter average running times. The proofs rely on two technical results that are of independent interest: a deviation lemma for self-normalized sums (Lemma 19) and a novel change of measure inequality for bandit models (Lemma 1).

“Bayesian Active Learning for Classification and Preference Learning”, Houlsby et al 2011

“Bayesian Active Learning for Classification and Preference Learning”⁠, Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, Máté Lengyel (2011-12-24; ⁠, ; backlinks; similar):

Information theoretic active learning has been widely studied for probabilistic models. For simple regression an optimal myopic policy is easily tractable. However, for other tasks and with more complex models, such as classification with nonparametric models, the optimal solution is harder to compute. Current approaches make approximations to achieve tractability. We propose an approach that expresses information gain in terms of predictive entropies, and apply this method to the Gaussian Process Classifier (GPC). Our approach makes minimal approximations to the full information theoretic objective. Our experimental performance compares favourably to many popular active learning algorithms, and has equal or lower computational complexity. We compare well to decision theoretic approaches also, which are privy to more information and require much more computational time. Secondly, by developing further a reformulation of binary preference learning to a classification problem, we extend our algorithm to Gaussian Process preference learning.

“Tea Reviews”, Branwen 2011

Tea: “Tea Reviews”⁠, Gwern Branwen (2011-04-13; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Teas I have drunk, with reviews and future purchases; focused primarily on oolongs and greens. Plus experiments on water.

Electric kettles are faster, but I was curious how much faster my electric kettle heated water to high or boiling temperatures than does my stove-top kettle. So I collected some data and compared them directly, trying out a number of statistical methods (principally: nonparametric & parametric tests of difference, linear & beta regression models, and a Bayesian measurement error model). My electric kettle is faster than the stove-top kettle (the difference is both statistically-significant p≪0.01 & the posterior probability of difference is P ≈ 1), and the modeling suggests time to boil is largely predictable from a combination of volume, end-temperature, and kettle type.

“Case Studies in Bayesian Computation Using INLA”, Martino & Rue 2010

2010-martino.pdf: “Case studies in Bayesian computation using INLA”⁠, Sara Martino, Håvard Rue (2010; backlinks; similar):

Latent Gaussian models are a common construct in statistical applications where a latent Gaussian field⁠, indirectly observed through data, is used to model, for instance, time and space dependence or the smooth effect of covariates. Many well-known statistical models, such as smoothing-spline models, space time models, semiparametric regression⁠, spatial and spatio-temporal models, log-Gaussian Cox models, and geostatistical models are latent Gaussian models.

Integrated Nested Laplace approximation (INLA) is a new approach to implement Bayesian inference for such models. It provides approximations of the posterior marginals of the latent variables which are both very accurate and extremely fast to compute. Moreover, INLA treats latent Gaussian models in a general way, thus allowing for a great deal of automation in the inferential procedure. The inla programme, bundled in the R library INLA⁠, is a prototype of such black-box for inference on latent Gaussian models which is both flexible and user-friendly. It is meant to, hopefully, make latent Gaussian models applicable, useful and appealing for a larger class of users.

[Keywords: approximate Bayesian inference, latent Gaussian model, Laplace approximations, structured additive regression models]

“Sorting from Noisy Information”, Braverman & Mossel 2009

“Sorting from Noisy Information”⁠, Mark Braverman, Elchanan Mossel (2009-10-07; backlinks; similar):

This paper studies problems of inferring order given noisy information. In these problems there is an unknown order (permutation) π on n elements denoted by 1,…,n. We assume that information is generated in a way correlated with π. The goal is to find a maximum likelihood π given the information observed. We will consider two different types of observations: noisy comparisons and noisy orders. The data in Noisy orders are permutations given from an exponential distribution correlated with π (this is also called the Mallow’s model). The data in Noisy Comparisons is a signal given for each pair of elements which is correlated with their true ordering.

In this paper we present polynomial time algorithms for solving both problems with high probability. As part of our proof we show that for both models the maximum likelihood solution π is close to the original permutation π.

Our results are of interest in applications to ranking, such as ranking in sports, or ranking of search items based on comparisons by experts.

“Can People Distinguish Pâté From Dog Food?”, Bohannon et al 2009

2009-bohannon.pdf: “Can People Distinguish Pâté From Dog Food?”⁠, John Bohannon, Robin Goldstein, Alexis Herschkowitsch (2009-04-01; ⁠, ; backlinks; similar):

Considering the similarity of its ingredients, canned dog food could be a suitable and inexpensive substitute for pâté or processed blended meat products such as Spam or liverwurst⁠. However, the social stigma associated with the human consumption of pet food makes an unbiased comparison challenging.

To prevent bias, Newman’s Own dog food was prepared with a food processor to have the texture and appearance of a liver mousse. In a double-blind test, subjects were presented with 5 unlabeled blended meat products, one of which was the prepared dog food. After ranking the samples on the basis of taste, subjects were challenged to identify which of the 5 was dog food.

Although 72% of subjects ranked the dog food as the worst of the 5 samples in terms of taste (Newell and MacFarlane multiple comparison⁠, p < 0.05), subjects were not better than random at correctly identifying the dog food.

[Popularizations: Bohannon 2009⁠, Bohannon et al 2010⁠.]

“Aggregating Inconsistent Information: Ranking and Clustering”, Ailon et al 2008

2008-ailon.pdf: “Aggregating inconsistent information: Ranking and clustering”⁠, Nir Ailon, Moses Charikar, Alantha Newman (2008-11; backlinks; similar):

We address optimization problems in which we are given contradictory pieces of input information and the goal is to find a globally consistent solution that minimizes the extent of disagreement with the respective inputs. Specifically, the problems we address are rank aggregation, the feedback arc set problem on tournaments, and correlation and consensus clustering. We show that for all these problems (and various weighted versions of them), we can obtain improved approximation factors using essentially the same remarkably simple algorithm. Additionally, we almost settle a long-standing conjecture of Bang-Jensen and Thomassen and show that unless NP⊆BPP, there is no polynomial time algorithm for the problem of minimum feedback arc set in tournaments.

“Pure Exploration for Multi-Armed Bandit Problems”, Bubeck et al 2008

“Pure Exploration for Multi-Armed Bandit Problems”⁠, Sébastien Bubeck, Rémi Munos, Gilles Stoltz (2008-02-19; ; backlinks; similar):

We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of forecasters that perform an on-line exploration of the arms. These forecasters are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when the cumulative regret is considered and when exploitation needs to be performed at the same time. We believe that this performance criterion is suited to situations when the cost of pulling an arm is expressed in terms of resources rather than rewards. We discuss the links between the simple and the cumulative regret. One of the main results in the case of a finite number of arms is a general lower bound on the simple regret of a forecaster in terms of its cumulative regret: the smaller the latter, the larger the former. Keeping this result in mind, we then exhibit upper bounds on the simple regret of some forecasters. The paper ends with a study devoted to continuous-armed bandit problems; we show that the simple regret can be minimized with respect to a family of probability distributions if and only if the cumulative regret can be minimized for it. Based on this equivalence, we are able to prove that the separable metric spaces are exactly the metric spaces on which these regrets can be minimized with respect to the family of all probability distributions with continuous mean-payoff functions.

“Do More Expensive Wines Taste Better? Evidence from a Large Sample of Blind Tastings”, Goldstein et al 2008

2008-goldstein.pdf: “Do More Expensive Wines Taste Better? Evidence from a Large Sample of Blind Tastings”⁠, Robin Goldstein, Johan Almenberg, Anna Dreber, John W. Emerson, Alexis Herschkowitsch, Jacob Katz (2008; ; backlinks; similar):

Individuals who are unaware of the price do not derive more enjoyment from more expensive wine.

In a sample of more than 6,000 blind tastings, we find that the correlation between price and overall rating is small and negative, suggesting that individuals on average enjoy more expensive wines slightly less. For individuals with wine training, however, we find indications of a non-negative relationship between price and enjoyment. Our results are robust to the inclusion of individual fixed effects, and are not driven by outliers: when omitting the top and bottom deciles of the price distribution, our qualitative results are strengthened, and the statistical-significance is improved further.

These findings suggest that non-expert wine consumers should not anticipate greater enjoyment of the intrinsic qualities of a wine simply because it is expensive or is appreciated by experts.

“Noisy Sorting Without Resampling”, Braverman & Mossel 2007

“Noisy Sorting Without Resampling”⁠, Mark Braverman, Elchanan Mossel (2007-07-06; backlinks; similar):

In this paper we study noisy sorting without re-sampling. In this problem there is an unknown order aπ(1) < … < aπ(n) where π is a permutation on n elements. The input is the status of ([^*n*^~2~]{.supsub}) queries of the form q(ai, xj), where q(ai, aj) = + with probability at least 1⁄2 +γ if π(i) > π(j) for all pairs ij, where γ > 0 is a constant and q(ai, aj) = −q(aj, ai) for all i and j. It is assumed that the errors are independent. Given the status of the queries the goal is to find the maximum likelihood order. In other words, the goal is find a permutation σ that minimizes the number of pairs σ(i) > σ(j) where q(σ(i), σ(j)) = −. The problem so defined is the feedback arc set problem on distributions of inputs, each of which is a tournament obtained as a noisy perturbations of a linear order. Note that when γ < 1⁄2 and n is large, it is impossible to recover the original order π.

It is known that the weighted feedback arc set problem on tournaments is NP-hard in general. Here we present an algorithm of running time n𝒪 (γ−4)> and sampling complexity Oγ(n log n) that with high probability solves the noisy sorting without re-sampling problem. We also show that if aσ(1), aσ(2), …, aσ(n) is an optimal solution of the problem then it is “close” to the original order. More formally, with high probability it holds that ∑i|σ(i) − π(i)| = Θ(n) and maxi|σ(i) − π(i)| = Θ(log n).

Our results are of interest in applications to ranking, such as ranking in sports, or ranking of search items based on comparisons by experts.

“Sympercents: Symmetric Percentage Differences on the 100 Loge Scale Simplify the Presentation of Log Transformed Data”, Cole 2000

2000-cole.pdf: “Sympercents: symmetric percentage differences on the 100 loge scale simplify the presentation of log transformed data”⁠, T. J. Cole (2000-11-08; similar):

The results of analyses on log transformed data are usually back-transformed and interpreted on the original scale.

Yet if natural logs are used this is not necessary—the log scale can be interpreted as it stands. A difference of natural logs corresponds to a fractional difference on the original scale. The agreement is exact if the fractional difference is based on the logarithmic mean. The transform y = 100 logex leads to differences, standard deviations and regression coefficients of y that are equivalent to symmetric percentage differences, standard deviations and regression coefficients of x.

Several simple clinical examples show that the 100 loge scale is the natural scale on which to express percentage differences. The term sympercent or s% is proposed for them. Sympercents should improve the presentation of log transformed data and lead to a wider understanding of the natural log transformation.

“Born Again Group Testing: Multiaccess Communications”, Wolf 1985

1985-wolf.pdf: “Born again group testing: Multiaccess communications”⁠, J. Wolf (1985-03-01)

The Rating of Chessplayers, Past and Present (Second Edition)”, Elo 1978

1978-elo-theratingofchessplayerspastandpresent.pdf: The Rating of Chessplayers, Past and Present (Second Edition)⁠, Arpad E. Elo (1978; ⁠, ; backlinks; similar):

One of the most extraordinary books ever written about chess and chessplayers, this authoritative study goes well beyond a lucid explanation of how today’s chessmasters and tournament players are rated⁠. Twenty years’ research and practice produce a wealth of thought-provoking and hitherto unpublished material on the nature and development of high-level talent:

Just what constitutes an “exceptional performance” at the chessboard? Can you really profit from chess lessons? What is the lifetime pattern of Grandmaster development? Where are the masters born? Does your child have master potential?

The step-by-step rating system exposition should enable any reader to become an expert on it. For some it may suggest fresh approaches to performance measurement and handicapping in bowling, bridge, golf and elsewhere. 43 charts, diagrams and maps supplement the text.

How and why are chessmasters statistically remarkable? How much will your rating rise if you work with the devotion of a Steinitz? At what age should study begin? What toll does age take, and when does it begin?

Development of the performance data, covering hundreds of years and thousands of players, has revealed a fresh and exciting version of chess history. One of the many tables identifies 500 all-time chess greats, with personal data and top lifetime performance ratings.

Just what does government assistance do for chess? What is the Soviet secret? What can we learn from the Icelanders? Why did the small city of Plovdiv produce three Grandmasters in only ten years? Who are the untitled dead? Did Euwe take the championship from Alekhine on a fluke? How would Fischer fare against Morphy in a ten-wins match?

“It was inevitable that this fascinating story be written”, asserts FIDE President Max Euwe, who introduces the book and recognizes the major part played by ratings in today’s burgeoning international activity. Although this is the definitive ratings work, with statistics alone sufficient to place it in every reference library, it was written by a gentle scientist for pleasurable reading—for the enjoyment of the truths, the questions, and the opportunities it reveals.