
Links
 “PiRank: Learning To Rank via Differentiable Sorting”, Swezey et al 2020
 “RankSmoothed Pairwise Learning In Perceptual Quality Assessment”, Talebi et al 2020
 “SelfPlay Learning Without a Reward Metric”, Schmidt et al 2019
 “GPT2 Preference Learning for Music Generation”, Branwen 2019
 “How Should We Critique Research?”, Branwen 2019
 “Group Testing: An Information Theory Perspective”, Aldridge et al 2019
 “Open Questions”, Branwen 2018
 “OptionGAN: Learning Joint RewardPolicy Options Using Generative Adversarial Inverse Reinforcement Learning”, Henderson et al 2017
 “Analogicalbased Bayesian Optimization”, Le et al 2017
 “Spectral Method and Regularized MLE Are Both Optimal for TopK Ranking”, Chen et al 2017
 “Deep Reinforcement Learning from Human Preferences”, Christiano et al 2017
 “DTS: Double Thompson Sampling for Dueling Bandits”, Wu & Liu 2016
 “Resorting Media Ratings”, Branwen 2015
 “Just Sort It! A Simple and Effective Approach to Active Preference Learning”, Maystre & Grossglauser 2015
 “On the Complexity of Best Arm Identification in MultiArmed Bandit Models”, Kaufmann et al 2014
 “Bayesian Active Learning for Classification and Preference Learning”, Houlsby et al 2011
 “Tea Reviews”, Branwen 2011
 “Case Studies in Bayesian Computation Using INLA”, Martino & Rue 2010
 “Sorting from Noisy Information”, Braverman & Mossel 2009
 “Can People Distinguish Pâté From Dog Food?”, Bohannon et al 2009
 “Aggregating Inconsistent Information: Ranking and Clustering”, Ailon et al 2008
 “Pure Exploration for MultiArmed Bandit Problems”, Bubeck et al 2008
 “Do More Expensive Wines Taste Better? Evidence from a Large Sample of Blind Tastings”, Goldstein et al 2008
 “Noisy Sorting Without Resampling”, Braverman & Mossel 2007
 “Sympercents: Symmetric Percentage Differences on the 100 Log_{e} Scale Simplify the Presentation of Log Transformed Data”, Cole 2000
 “Born Again Group Testing: Multiaccess Communications”, Wolf 1985
 “The Rating of Chessplayers, Past and Present (Second Edition)”, Elo 1978
 Miscellaneous
Links
“PiRank: Learning To Rank via Differentiable Sorting”, Swezey et al 2020
“PiRank: Learning To Rank via Differentiable Sorting”, (20201212; ; backlinks; similar):
A key challenge with machine learning approaches for ranking is the gap between the performance metrics of interest and the surrogate loss functions that can be optimized with gradientbased methods. This gap arises because ranking metrics typically involve a sorting operation which is not differentiable w.r.t. the model parameters. Prior works have proposed surrogates that are loosely related to ranking metrics or simple smoothed versions thereof. We propose PiRank, a new class of differentiable surrogates for ranking, which employ a continuous, temperaturecontrolled relaxation to the sorting operator. We show that PiRank exactly recovers the desired metrics in the limit of zero temperature and scales favorably with the problem size, both in theory and practice. Empirically, we demonstrate that PiRank significantly improves over existing approaches on publicly available internetscale learningtorank benchmarks.
“RankSmoothed Pairwise Learning In Perceptual Quality Assessment”, Talebi et al 2020
2020talebi.pdf#google
: “RankSmoothed Pairwise Learning In Perceptual Quality Assessment”, (20200930; ; backlinks; similar):
Conducting pairwise comparisons is a widely used approach in curating human perceptual preference data. Typically raters are instructed to make their choices according to a specific set of rules that address certain dimensions of image quality and aesthetics. The outcome of this process is a dataset of sampled image pairs with their associated empirical preference probabilities. Training a model on these pairwise preferences is a common deep learning approach. However, optimizing by gradient descent through minibatch learning means that the “global” ranking of the images is not explicitly taken into account. In other words, each step of the gradient descent relies only on a limited number of pairwise comparisons. In this work, we demonstrate that regularizing the pairwise empirical probabilities with aggregated rankwise probabilities leads to a more reliable training loss. We show that training a deep image quality assessment model with our ranksmoothed loss consistently improves the accuracy of predicting human preferences.
“SelfPlay Learning Without a Reward Metric”, Schmidt et al 2019
“SelfPlay Learning Without a Reward Metric”, (20191216; ; backlinks; similar):
The AlphaZero algorithm for the learning of strategy games via selfplay, which has produced superhuman ability in the games of Go, chess, and shogi, uses a quantitative reward function for game outcomes, requiring the users of the algorithm to explicitly balance different components of the reward against each other, such as the game winner and margin of victory. We present a modification to the AlphaZero algorithm that requires only a total ordering over game outcomes, obviating the need to perform any quantitative balancing of reward components. We demonstrate that this system learns optimal play in a comparable amount of time to AlphaZero on a sample game.
“GPT2 Preference Learning for Music Generation”, Branwen 2019
GPT2preferencelearning
: “GPT2 Preference Learning for Music Generation”, (20191216; ; backlinks; similar):
Experiments with OpenAI’s ‘preference learning’ approach, which trains a NN to predict global quality of datapoints, and then uses reinforcement learning to optimize that directly, rather than proxies. I am unable to improve quality, perhaps due to toofew ratings.
Standard language generation neural network models, like GPT2, are trained via likelihood training to imitate human text corpuses. Generated text suffers from persistent flaws like repetition, due to myopic generation wordbyword, and cannot improve on the training data because they are trained to predict ‘realistic’ completions of the training data.
A proposed alternative is to use reinforcement learning to train the NNs, to encourage global properties like coherence & lack of repetition, and potentially improve over the original corpus’s average quality. Preference learning trains a reward function on human ratings, and uses that as the ‘environment’ for a blackbox DRL algorithm like PPO.
OpenAI released a codebase implementing this dualmodel preference learning approach for textual generation, based on GPT2. Having previously used GPT2 for poetry & music generation, I experimented with GPT2 preference learning for unconditional music and poetry generation.
I found that preference learning seemed to work better for music than poetry, and seemed to reduce the presence of repetition artifacts, but the results, at n ≈ 7,400 ratings compiled over 23 iterations of training+sampling November 2019–January 2020, are not dramatically better than alternative improvements like scaling up models or more thorough datacleaning or more stringent sample curation. My blind ratings using n ≈ 200 comparisons showed no large advantage for the RLtuned samples (winning only 93 of 210 comparisons, or 46%).
This may be due to insufficient ratings, bad hyperparameters, or not using samples generated with common prefixes, but I suspect it’s the former, as some NLP tasks in Ziegler et al 2019 required up to 60k ratings for good performance, and the reward model appeared to achieve poor performance & succumb to adversarial examples easily.
Working with it, I suspect that preference learning is unnecessarily sampleinefficient & datainefficient, and that the blackbox reinforcement learning approach is inferior to directly using the reward model to optimize text samples, and propose two major architectural overhauls: have the reward model directly model the implied ranking of every datapoint, and drop the agent model entirely in favor of backproppowered gradient ascent which optimizes sequences to maximize the reward model’s output.
“How Should We Critique Research?”, Branwen 2019
Researchcriticism
: “How Should We Critique Research?”, (20190519; ; backlinks; similar):
Criticizing studies and statistics is hard in part because so many criticisms are possible, rendering them meaningless. What makes a good criticism is the chance of being a ‘difference which makes a difference’ to our ultimate actions.
Scientific and statistical research must be read with a critical eye to understand how credible the claims are. The Reproducibility Crisis and the growth of metascience have demonstrated that much research is of low quality and often false.
But there are so many possible things any given study could be criticized for, falling short of an unobtainable ideal, that it becomes unclear which possible criticism is important, and they may degenerate into mere rhetoric. How do we separate fatal flaws from unfortunate caveats from specious quibbling?
I offer a pragmatic criterion: what makes a criticism important is how much it could change a result if corrected and how much that would then change our decisions or actions: to what extent it is a “difference which makes a difference”.
This is why issues of research fraud, causal inference, or biases yielding overestimates are universally important: because a ‘causal’ effect turning out to be zero effect or grossly overestimated will change almost all decisions based on such research; while on the other hand, other issues like measurement error or distributional assumptions, which are equally common, are often not important: because they typically yield much smaller changes in conclusions, and hence decisions.
If we regularly ask whether a criticism would make this kind of difference, it will be clearer which ones are important criticisms, and which ones risk being rhetorical distractions and obstructing meaningful evaluation of research.
“Group Testing: An Information Theory Perspective”, Aldridge et al 2019
2019aldridge.pdf
: “Group Testing: An Information Theory Perspective”, Aldridge, Matthew, Johnson, Oliver, Scarlett, Jonathan (20190101)
“Open Questions”, Branwen 2018
Questions
: “Open Questions”, (20181017; ; backlinks; similar):
Some anomalies/questions which are not necessarily important, but do puzzle me or where I find existing explanations to be unsatisfying.
A list of some questions which are not necessarily important, but do puzzle me or where I find existing ‘answers’ to be unsatisfying, categorized by subject (along the lines of Patrick Collison’s list & Alex Guzey; see also my list of project ideas).
“OptionGAN: Learning Joint RewardPolicy Options Using Generative Adversarial Inverse Reinforcement Learning”, Henderson et al 2017
“OptionGAN: Learning Joint RewardPolicy Options using Generative Adversarial Inverse Reinforcement Learning”, (20170920; ; backlinks; similar):
Reinforcement learning has shown promise in learning policies that can solve complex problems. However, manually specifying a good reward function can be difficult, especially for intricate tasks. Inverse reinforcement learning offers an useful paradigm to learn the underlying reward function directly from expert demonstrations. Yet in reality, the corpus of demonstrations may contain trajectories arising from a diverse set of underlying reward functions rather than a single one. Thus, in inverse reinforcement learning, it is useful to consider such a decomposition. The options framework in reinforcement learning is specifically designed to decompose policies in a similar light. We therefore extend the options framework and propose a method to simultaneously recover reward options in addition to policy options. We leverage adversarial methods to learn joint rewardpolicy options using only observed expert states. We show that this approach works well in both simple and complex continuous control tasks and shows significant performance increases in oneshot transfer learning.
“Analogicalbased Bayesian Optimization”, Le et al 2017
“Analogicalbased Bayesian Optimization”, (20170919; ; backlinks; similar):
Some realworld problems revolve to solve the optimization problem max_{x ∈ 𝓍}f(x) where f(.) is a blackbox function and X might be the set of nonvectorial objects (eg. distributions) where we can only define a symmetric and nonnegative similarity score on it. This setting requires a novel view for the standard framework of Bayesian Optimization that generalizes the core insightful spirit of this framework.
With this spirit, in this paper, we propose Analogicalbased Bayesian Optimization that can maximize blackbox function over a domain where only a similarity score can be defined. Our pathway is as follows: we first base on the geometric view of Gaussian Processes (GP) to define the concept of influence level that allows us to analytically represent predictive means and variances of GP posteriors and base on that view to enable replacing kernel similarity by a more genetic similarity score. Furthermore, we also propose two strategies to find a batch of query points that can efficiently handle high dimensional data.
“Spectral Method and Regularized MLE Are Both Optimal for TopK Ranking”, Chen et al 2017
“Spectral Method and Regularized MLE Are Both Optimal for TopK Ranking”, (20170731; backlinks; similar):
This paper is concerned with the problem of topK ranking from pairwise comparisons. Given a collection of n items and a few pairwise comparisons across them, one wishes to identify the set of K items that receive the highest ranks.
To tackle this problem, we adopt the logistic parametric model—the BradleyTerryLuce model, where each item is assigned a latent preference score, and where the outcome of each pairwise comparison depends solely on the relative scores of the two items involved. Recent works have made substantial progress towards characterizing the performance (eg. the mean square error for estimating the scores) of several classical methods, including the spectral method and the maximum likelihood estimator (MLE). However, where they stand regarding topK ranking remains unsettled.
We demonstrate that under a natural random sampling model, the spectral method alone, or the regularized MLE alone, is minimax optimal in terms of the sample complexity—the number of paired comparisons needed to ensure exact topK identification, for the fixed dynamic range regime. This is accomplished via optimal control of the entrywise error of the score estimates.
We complement our theoretical studies by numerical experiments, confirming that both methods yield low entrywise errors for estimating the underlying scores. Our theory is established via a novel leaveoneout trick, which proves effective for analyzing both iterative and noniterative procedures. Along the way, we derive an elementary eigenvector perturbation bound for probability transition matrices, which parallels the DavisKahan sinΘ theorem for symmetric matrices. This also allows us to close the gap between the 𝓁_{2} error upper bound for the spectral method and the minimax lower limit.
“Deep Reinforcement Learning from Human Preferences”, Christiano et al 2017
“Deep reinforcement learning from human preferences”, (20170612; ; backlinks; similar):
For sophisticated reinforcement learning (RL) systems to interact usefully with realworld environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (nonexpert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent’s interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to stateoftheart RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any that have been previously learned from human feedback.
“DTS: Double Thompson Sampling for Dueling Bandits”, Wu & Liu 2016
“DTS: Double Thompson Sampling for Dueling Bandits”, (20160425; ; backlinks; similar):
In this paper, we propose a Double Thompson Sampling (DTS) algorithm for dueling bandit problems. As indicated by its name, DTS selects both the first and the second candidates according to Thompson Sampling.
Specifically, DTS maintains a posterior distribution for the preference matrix, and chooses the pair of arms for comparison by sampling twice from the posterior distribution. This simple algorithm applies to general Copeland dueling bandits, including Condorcet dueling bandits as its special case.
For general Copeland dueling bandits, we show that DTS achieves \U0001D442(K^{2} log T) regret. For Condorcet dueling bandits, we further simplify the DTS algorithm and show that the simplified DTS algorithm achieves \U0001D442(K log T + K^{2} log log T) regret. Simulation results based on both synthetic and realworld data demonstrate the efficiency of the proposed DTS algorithm.
“Resorting Media Ratings”, Branwen 2015
Resorter
: “Resorting Media Ratings”, (20150907; ; backlinks; similar):
Commandline tool providing interactive statistical pairwise ranking and sorting of items
Usercreated datasets using ordinal scales (such as media ratings) have tendencies to drift or ‘clump’ towards the extremes and fail to be informative as possible, falling prey to ceiling effects and making it difficult to distinguish between the mediocre & excellent.
This can be counteracted by rerating the dataset to create an uniform (and hence, informative) distribution of ratings—but such manual rerating is difficult.
I provide an anytime CLI program,
resorter
, written in R (should be crossplatform but only tested on Linux) which keeps track of comparisons, infers underlying ratings assuming that they are noisy in the ELOlike BradleyTerry model, and interactively & intelligently queries the user with comparisons of the media with the most uncertain current ratings, until the user ends the session and a fully rescaled set of ratings are output.
“Just Sort It! A Simple and Effective Approach to Active Preference Learning”, Maystre & Grossglauser 2015
“Just Sort It! A Simple and Effective Approach to Active Preference Learning”, (20150219; ; backlinks; similar):
We address the problem of learning a ranking by using adaptively chosen pairwise comparisons. Our goal is to recover the ranking accurately but to sample the comparisons sparingly. If all comparison outcomes are consistent with the ranking, the optimal solution is to use an efficient sorting algorithm, such as Quicksort. But how do sorting algorithms behave if some comparison outcomes are inconsistent with the ranking?
We give favorable guarantees for Quicksort for the popular BradleyTerry model, under natural assumptions on the parameters. Furthermore, we empirically demonstrate that sorting algorithms lead to a very simple and effective active learning strategy: repeatedly sort the items. This strategy performs as well as stateoftheart methods (and much better than random sampling) at a minuscule fraction of the computational cost.
“On the Complexity of Best Arm Identification in MultiArmed Bandit Models”, Kaufmann et al 2014
“On the Complexity of Best Arm Identification in MultiArmed Bandit Models”, (20140716; ; backlinks; similar):
The stochastic multiarmed bandit model is a simple abstraction that has proven useful in many different contexts in statistics and machine learning. Whereas the achievable limit in terms of regret minimization is now well known, our aim is to contribute to a better understanding of the performance in terms of identifying the m best arms. We introduce generic notions of complexity for the two dominant frameworks considered in the literature: fixedbudget and fixedconfidence settings.
In the fixedconfidence setting, we provide the first known distributiondependent lower bound on the complexity that involves informationtheoretic quantities and holds when m is larger than 1 under general assumptions. In the specific case of two armedbandits, we derive refined lower bounds in both the fixedconfidence and fixedbudget settings, along with matching algorithms for Gaussian and Bernoulli bandit models. These results show in particular that the complexity of the fixedbudget setting may be smaller than the complexity of the fixedconfidence setting, contradicting the familiar behavior observed when testing fully specified alternatives.
In addition, we also provide improved sequential stopping rules that have guaranteed error probabilities and shorter average running times. The proofs rely on two technical results that are of independent interest: a deviation lemma for selfnormalized sums (Lemma 19) and a novel change of measure inequality for bandit models (Lemma 1).
“Bayesian Active Learning for Classification and Preference Learning”, Houlsby et al 2011
“Bayesian Active Learning for Classification and Preference Learning”, (20111224; ; backlinks; similar):
Information theoretic active learning has been widely studied for probabilistic models. For simple regression an optimal myopic policy is easily tractable. However, for other tasks and with more complex models, such as classification with nonparametric models, the optimal solution is harder to compute. Current approaches make approximations to achieve tractability. We propose an approach that expresses information gain in terms of predictive entropies, and apply this method to the Gaussian Process Classifier (GPC). Our approach makes minimal approximations to the full information theoretic objective. Our experimental performance compares favourably to many popular active learning algorithms, and has equal or lower computational complexity. We compare well to decision theoretic approaches also, which are privy to more information and require much more computational time. Secondly, by developing further a reformulation of binary preference learning to a classification problem, we extend our algorithm to Gaussian Process preference learning.
“Tea Reviews”, Branwen 2011
Tea
: “Tea Reviews”, (20110413; ; backlinks; similar):
Teas I have drunk, with reviews and future purchases; focused primarily on oolongs and greens. Plus experiments on water.
Electric kettles are faster, but I was curious how much faster my electric kettle heated water to high or boiling temperatures than does my stovetop kettle. So I collected some data and compared them directly, trying out a number of statistical methods (principally: nonparametric & parametric tests of difference, linear & beta regression models, and a Bayesian measurement error model). My electric kettle is faster than the stovetop kettle (the difference is both statisticallysignificant p≪0.01 & the posterior probability of difference is P ≈ 1), and the modeling suggests time to boil is largely predictable from a combination of volume, endtemperature, and kettle type.
“Case Studies in Bayesian Computation Using INLA”, Martino & Rue 2010
2010martino.pdf
: “Case studies in Bayesian computation using INLA”, (2010; backlinks; similar):
Latent Gaussian models are a common construct in statistical applications where a latent Gaussian field, indirectly observed through data, is used to model, for instance, time and space dependence or the smooth effect of covariates. Many wellknown statistical models, such as smoothingspline models, space time models, semiparametric regression, spatial and spatiotemporal models, logGaussian Cox models, and geostatistical models are latent Gaussian models.
Integrated Nested Laplace approximation (INLA) is a new approach to implement Bayesian inference for such models. It provides approximations of the posterior marginals of the latent variables which are both very accurate and extremely fast to compute. Moreover, INLA treats latent Gaussian models in a general way, thus allowing for a great deal of automation in the inferential procedure. The
inla
programme, bundled in the R libraryINLA
, is a prototype of such blackbox for inference on latent Gaussian models which is both flexible and userfriendly. It is meant to, hopefully, make latent Gaussian models applicable, useful and appealing for a larger class of users.[Keywords: approximate Bayesian inference, latent Gaussian model, Laplace approximations, structured additive regression models]
“Sorting from Noisy Information”, Braverman & Mossel 2009
“Sorting from Noisy Information”, (20091007; backlinks; similar):
This paper studies problems of inferring order given noisy information. In these problems there is an unknown order (permutation) π on n elements denoted by 1,…,n. We assume that information is generated in a way correlated with π. The goal is to find a maximum likelihood π^{✱} given the information observed. We will consider two different types of observations: noisy comparisons and noisy orders. The data in Noisy orders are permutations given from an exponential distribution correlated with π (this is also called the Mallow’s model). The data in Noisy Comparisons is a signal given for each pair of elements which is correlated with their true ordering.
In this paper we present polynomial time algorithms for solving both problems with high probability. As part of our proof we show that for both models the maximum likelihood solution π^{✱} is close to the original permutation π.
Our results are of interest in applications to ranking, such as ranking in sports, or ranking of search items based on comparisons by experts.
“Can People Distinguish Pâté From Dog Food?”, Bohannon et al 2009
2009bohannon.pdf
: “Can People Distinguish Pâté From Dog Food?”, (20090401; ; backlinks; similar):
Considering the similarity of its ingredients, canned dog food could be a suitable and inexpensive substitute for pâté or processed blended meat products such as Spam or liverwurst. However, the social stigma associated with the human consumption of pet food makes an unbiased comparison challenging.
To prevent bias, Newman’s Own dog food was prepared with a food processor to have the texture and appearance of a liver mousse. In a doubleblind test, subjects were presented with 5 unlabeled blended meat products, one of which was the prepared dog food. After ranking the samples on the basis of taste, subjects were challenged to identify which of the 5 was dog food.
Although 72% of subjects ranked the dog food as the worst of the 5 samples in terms of taste (Newell and MacFarlane multiple comparison, p < 0.05), subjects were not better than random at correctly identifying the dog food.
[Popularizations: Bohannon 2009, Bohannon et al 2010.]
“Aggregating Inconsistent Information: Ranking and Clustering”, Ailon et al 2008
2008ailon.pdf
: “Aggregating inconsistent information: Ranking and clustering”, (200811; backlinks; similar):
We address optimization problems in which we are given contradictory pieces of input information and the goal is to find a globally consistent solution that minimizes the extent of disagreement with the respective inputs. Specifically, the problems we address are rank aggregation, the feedback arc set problem on tournaments, and correlation and consensus clustering. We show that for all these problems (and various weighted versions of them), we can obtain improved approximation factors using essentially the same remarkably simple algorithm. Additionally, we almost settle a longstanding conjecture of BangJensen and Thomassen and show that unless NP⊆BPP, there is no polynomial time algorithm for the problem of minimum feedback arc set in tournaments.
“Pure Exploration for MultiArmed Bandit Problems”, Bubeck et al 2008
“Pure Exploration for MultiArmed Bandit Problems”, (20080219; ; backlinks; similar):
We consider the framework of stochastic multiarmed bandit problems and study the possibilities and limitations of forecasters that perform an online exploration of the arms. These forecasters are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when the cumulative regret is considered and when exploitation needs to be performed at the same time. We believe that this performance criterion is suited to situations when the cost of pulling an arm is expressed in terms of resources rather than rewards. We discuss the links between the simple and the cumulative regret. One of the main results in the case of a finite number of arms is a general lower bound on the simple regret of a forecaster in terms of its cumulative regret: the smaller the latter, the larger the former. Keeping this result in mind, we then exhibit upper bounds on the simple regret of some forecasters. The paper ends with a study devoted to continuousarmed bandit problems; we show that the simple regret can be minimized with respect to a family of probability distributions if and only if the cumulative regret can be minimized for it. Based on this equivalence, we are able to prove that the separable metric spaces are exactly the metric spaces on which these regrets can be minimized with respect to the family of all probability distributions with continuous meanpayoff functions.
“Do More Expensive Wines Taste Better? Evidence from a Large Sample of Blind Tastings”, Goldstein et al 2008
2008goldstein.pdf
: “Do More Expensive Wines Taste Better? Evidence from a Large Sample of Blind Tastings”, (2008; ; backlinks; similar):
Individuals who are unaware of the price do not derive more enjoyment from more expensive wine.
In a sample of more than 6,000 blind tastings, we find that the correlation between price and overall rating is small and negative, suggesting that individuals on average enjoy more expensive wines slightly less. For individuals with wine training, however, we find indications of a nonnegative relationship between price and enjoyment. Our results are robust to the inclusion of individual fixed effects, and are not driven by outliers: when omitting the top and bottom deciles of the price distribution, our qualitative results are strengthened, and the statisticalsignificance is improved further.
These findings suggest that nonexpert wine consumers should not anticipate greater enjoyment of the intrinsic qualities of a wine simply because it is expensive or is appreciated by experts.
“Noisy Sorting Without Resampling”, Braverman & Mossel 2007
“Noisy Sorting Without Resampling”, (20070706; backlinks; similar):
In this paper we study noisy sorting without resampling. In this problem there is an unknown order a_{π}(1) < … < a_{π(n)} where π is a permutation on n elements. The input is the status of ([^*n*^~2~]{.supsub}) queries of the form q(a_{i}, x_{j}), where q(a_{i}, a_{j}) = + with probability at least 1⁄2 +γ if π(i) > π(j) for all pairs i ≠ j, where γ > 0 is a constant and q(a_{i}, a_{j}) = −q(a_{j}, a_{i}) for all i and j. It is assumed that the errors are independent. Given the status of the queries the goal is to find the maximum likelihood order. In other words, the goal is find a permutation σ that minimizes the number of pairs σ(i) > σ(j) where q(σ(i), σ(j)) = −. The problem so defined is the feedback arc set problem on distributions of inputs, each of which is a tournament obtained as a noisy perturbations of a linear order. Note that when γ < 1⁄2 and n is large, it is impossible to recover the original order π.
It is known that the weighted feedback arc set problem on tournaments is NPhard in general. Here we present an algorithm of running time n^{𝒪 (γ}−4^{)}> and sampling complexity O_{γ}(n log n) that with high probability solves the noisy sorting without resampling problem. We also show that if a_{σ(1)}, a_{σ(2)}, …, a_{σ(n)} is an optimal solution of the problem then it is “close” to the original order. More formally, with high probability it holds that ∑_{i}σ(i) − π(i) = Θ(n) and max_{i}σ(i) − π(i) = Θ(log n).
Our results are of interest in applications to ranking, such as ranking in sports, or ranking of search items based on comparisons by experts.
“Sympercents: Symmetric Percentage Differences on the 100 Log_{e} Scale Simplify the Presentation of Log Transformed Data”, Cole 2000
2000cole.pdf
: “Sympercents: symmetric percentage differences on the 100 log_{e} scale simplify the presentation of log transformed data”, (20001108; similar):
The results of analyses on log transformed data are usually backtransformed and interpreted on the original scale.
Yet if natural logs are used this is not necessary—the log scale can be interpreted as it stands. A difference of natural logs corresponds to a fractional difference on the original scale. The agreement is exact if the fractional difference is based on the logarithmic mean. The transform y = 100 log_{e}x leads to differences, standard deviations and regression coefficients of y that are equivalent to symmetric percentage differences, standard deviations and regression coefficients of x.
Several simple clinical examples show that the 100 log_{e} scale is the natural scale on which to express percentage differences. The term sympercent or s% is proposed for them. Sympercents should improve the presentation of log transformed data and lead to a wider understanding of the natural log transformation.
“Born Again Group Testing: Multiaccess Communications”, Wolf 1985
1985wolf.pdf
: “Born again group testing: Multiaccess communications”, J. Wolf (19850301)
“The Rating of Chessplayers, Past and Present (Second Edition)”, Elo 1978
1978elotheratingofchessplayerspastandpresent.pdf
: “The Rating of Chessplayers, Past and Present (Second Edition)”, (1978; ; backlinks; similar):
One of the most extraordinary books ever written about chess and chessplayers, this authoritative study goes well beyond a lucid explanation of how today’s chessmasters and tournament players are rated. Twenty years’ research and practice produce a wealth of thoughtprovoking and hitherto unpublished material on the nature and development of highlevel talent:
Just what constitutes an “exceptional performance” at the chessboard? Can you really profit from chess lessons? What is the lifetime pattern of Grandmaster development? Where are the masters born? Does your child have master potential?
The stepbystep rating system exposition should enable any reader to become an expert on it. For some it may suggest fresh approaches to performance measurement and handicapping in bowling, bridge, golf and elsewhere. 43 charts, diagrams and maps supplement the text.
How and why are chessmasters statistically remarkable? How much will your rating rise if you work with the devotion of a Steinitz? At what age should study begin? What toll does age take, and when does it begin?
Development of the performance data, covering hundreds of years and thousands of players, has revealed a fresh and exciting version of chess history. One of the many tables identifies 500 alltime chess greats, with personal data and top lifetime performance ratings.
Just what does government assistance do for chess? What is the Soviet secret? What can we learn from the Icelanders? Why did the small city of Plovdiv produce three Grandmasters in only ten years? Who are the untitled dead? Did Euwe take the championship from Alekhine on a fluke? How would Fischer fare against Morphy in a tenwins match?
“It was inevitable that this fascinating story be written”, asserts FIDE President Max Euwe, who introduces the book and recognizes the major part played by ratings in today’s burgeoning international activity. Although this is the definitive ratings work, with statistics alone sufficient to place it in every reference library, it was written by a gentle scientist for pleasurable reading—for the enjoyment of the truths, the questions, and the opportunities it reveals.
Miscellaneous

https://apps.dtic.mil/sti/pdfs/ADA417190.pdf#page=15
(backlinks) 
http://www.yisongyue.com/publications/icml2011_beat_the_mean.pdf
(backlinks) 
http://www.astro.cornell.edu/staff/loredo/bayes/bae.pdf
(backlinks) 
http://wwwcgi.cs.cmu.edu/afs/cs.cmu.edu/Web/People/harchol/Papers/SODA94ranking.pdf
(backlinks) 
http://math.bu.edu/individual/mg/research/glicko.pdf
(backlinks) 
http://johnjosephhorton.com/papers/longrun.pdf
(backlinks) 
20200403florianloitschtenkilogramsofchocolatetournamentdata.ods
(backlinks) 
2002pelc.pdf
( ; backlinks) 
2014kovacs.pdf
( ; backlinks)