fine-tuned GPT-3 to more accurately answer open-ended questions using a text-based web browser. Our prototype copies how humans research answers
to questions online—it submits search queries, follows links, and scrolls up and down web pages. It is trained to cite its sources, which makes it easier to give
feedback to improve factual accuracy. We’re excited about developing more truthful AI, but challenges remain, such as coping with unfamiliar types of
Language models like GPT-3 are useful for many different tasks, but have a tendency to “hallucinate”
information when performing tasks requiring obscure real-world knowledge.23 To address this, we taught GPT-3
to use a text-based web-browser. The model is provided with an open-ended question and a summary of the browser state, and must issue commands such as
“Search …”, “Find in page: …” or “Quote: …”. In this way, the model collects passages from web pages, and then uses these to compose an answer.
The model is fine-tuned from GPT-3 usingthesamegeneralmethods we’ve used
previously. We begin by training the model to copy human demonstrations, which gives it the ability to use the text-based browser to answer questions. Then we
improve the helpfulness and accuracy of the model’s answers, by training a reward model to predict human preferences, and optimizing against it using either
reinforcement learning or
…Our models outperform GPT-3 on TruthfulQA and exhibit more favourable scaling properties. However, our models
lag behind human performance, partly because they sometimes quote from unreliable sources (as shown in the question about ghosts above). We hope to reduce the
frequency of these failures using techniques like adversarial training.
…Evaluating factual accuracy: …
However, this approach raises a number of questions. What makes a source reliable? What claims are obvious enough to not require support? What trade-off should
be made between evaluations of factual accuracy and other criteria such as coherence? All of these were difficult judgment calls. We do not think that our model
picked up on much of this nuance, since it still makes basic errors. But we expect these kinds of decisions to become more important as AI systems improve, and
cross-disciplinary research is needed to develop criteria that are both practical and epistemically sound. We also expect further considerations such as
transparency to be important.
Eventually, having models cite their sources will not be enough to evaluate factual accuracy. A sufficiently capable model would cherry-pick sources it expects
humans to find convincing, even if they do not reflect a fair assessment of the evidence. There are already signs of this happening (see the questions about
boats above). We hope to mitigate this using methods like debate.
We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows
the model to search and navigate the web.
By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality
with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers.
We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences.
This model’s answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from
In this work we leverage existing solutions to these components: we outsource document retrieval to the Microsoft BingWeb Search API, and utilize unsupervised pre-training to achieve
high-quality synthesis by fine-tuning GPT-3. Instead of trying to improve these ingredients, we focus on combining
them using more faithful training objectives. Following Stiennon et al 2020, we use human feedback to directly optimize answer quality, allowing us to achieve performance competitive
We make 2 key contributions:
We create a text-based web-browsing environment that a fine-tuned language model can interact with. This allows us to improve both retrieval and synthesis in
an end to end fashion using general methods such as imitation learning and reinforcement learning.
We generate answers with references: passages extracted by the model from web pages while browsing. This is crucial for allowing labelers to judge
the factual accuracy of answers, without engaging in a difficult and subjective process of independent research.
…We use this data in 4 main ways: behavior cloning (ie. supervised fine-tuning) using the demonstrations, reward modeling using the comparisons, reinforcement
learning against the reward model, and rejection sampling against the reward model. Our best model uses a combination of behavior cloning and rejection sampling.
We also find reinforcement learning to provide some benefit when inference-time compute is more limited.
…We evaluate our best model in 3 different ways. First, we compare our model’s answers to answers written by our human demonstrators on a held-out set of
questions. Our model’s answers are preferred 56% of the time, demonstrating human-level usage of the text-based browser. Second, we compare our model’s answers
(with references stripped, for fairness) to the highest-voted answer provided by the ELI5 dataset. Our model’s
answers are preferred 69% of the time. Third, we evaluate our model on TruthfulQA, an adversarial dataset of short-form questions. Our model’s answers are true 75% of the time, and are both true and informative 54%
of the time, outperforming our base model (GPT-3), but falling short of human performance
…Environment design: …For this approach, we designed a text-based web-browsing environment. The language model is prompted with a written
summary of the current state of the environment, including the question, the text of the current page at the current cursor location, and some other information
(see Figure 1(b)). In response to this, the model must issue one of the commands given in Table 1, which performs an
action such as running a Bing search, clicking on a link, or scrolling around. This process is then repeated with a fresh context (hence, the only memory of
previous steps is what is recorded in the summary).
While the model is browsing, one of the actions it can take is to quote an extract from the current page. When this is performed, the page title, domain name
and extract are recorded to be used later as a reference. Browsing then continues until either the model issues a command to end browsing, the maximum number of
actions has been reached, or the maximum total length of references has been reached. At this point, as long as there is at least one reference, the model is
prompted with the question and the references, and must compose its final answer
Table 2: An answer produced by our 175B best-of-64 model to a randomly-chosen question from the ELI5 test set
(not cherry-picked). The full text of the references can be found in Appendix J, along with
answers from our human demonstrators and the ELI5 dataset. Further samples are available.
Why did we decide that certain words were “bad” and shouldn’t be used in social settings?
We have all been taught that there are certain words that should not be used in public, at least not without good reason. But why do some
words fall into this category while others do not? It turns out that there is not a single answer to this question, as each word has a different history and
carries different meanings in different cultures and families . However, there are some general categories that “bad” words tend to fall into. “Bad” words
generally relate to parts of life that we don’t like talking about in public, like bathroom functions, or negative ways of talking about people’s religion,
skin color, ability, or gender . Often, people say bad words when they are experiencing strong emotions, and those words are linked to our emotions
References (titles only)
[1, 2, 3] Why Are Some Words ‘Bad’? | Vermont Public Radio (www.vpr.org)
 On Words: ‘Bad’ Words and Why We Should Study Them | UVA Today (news.virginia.edu)
 The Science of Curse Words: Why The &@$! Do We Swear? (www.babbel.com)
…Our results are shown in Figure 2. Our best model, the 175B best-of-64 model, produces answers that are preferred to those written by our
human demonstrators 56% of the time. This suggests that the use of human feedback is essential, since one would not expect to exceed 50% preference by imitating
demonstrations alone (although it may still be possible, by producing a less noisy policy). The same model produces answers that are preferred to the reference
answers from the ELI5 dataset 69% of the time. This is a substantial improvement over Krishna et
al,2021 whose best model’s answers are preferred 23% of the time to the reference answers, although they use substantially less compute than even our
…The combination of RL and rejection sampling also fails to offer much benefit over rejection sampling alone. One possible reason for this is that RL and
rejection sampling are optimizing against the same reward model, which can easily be overoptimized (especially by RL, as noted above). In addition to this, RL
reduces the entropy of the
policy, which hurts exploration. Adapting the RL objective to optimize rejection sampling performance is an interesting direction for future research. It is also
worth highlighting the importance of carefully tuning the BC baseline for these comparisons. As discussed in Appendix E, we tuned the number of BC
epochs and the sampling temperature using a combination of human evaluations and reward model score. This alone closed much of the gap we originally saw between BC
…Scaling trends with dataset size and parameter count are shown in Figures 6 and 7. For dataset size, doubling the number of
demonstrations increased the policy’s reward model score by about 0.13, and doubling the number of comparisons increased the reward model’s accuracy by about 1.8%.
For parameter count, the trends were noisier, but doubling the number of parameters in the policy increased its reward model score by roughly 0.09, and doubling
the number of parameters in the reward model increased its accuracy by roughly 0.4%.
…For rejection sampling, we analyzed how to trade off the number of samples against the number of model parameters for a given inference-time compute budget
(see Figure 8). We found that it is generally compute-efficient to use some amount of rejection sampling, but not too much. The models for our
main evaluations come from the Pareto frontier of this trade-off: the 760M best-of-4 model, the 13B best-of-16 model, and the 175B best-of-64 model.
[cf. Jones 2021]
Information retrieval is an important component in natural language processing, for knowledge intensive tasks such as question answering and fact checking.
Recently, information retrieval has seen the emergence of dense retrievers, based on neural networks, as an alternative to classical sparse methods based on
term-frequency. These models have obtained state-of-the-art results on datasets and tasks where large training sets are available. However, they do not transfer
well to new domains or applications with no training data, and are often outperformed by term-frequency methods such as BM25 which are not supervised. Thus, a
natural question is whether it is possible to train dense retrievers without supervision. In this work, we explore the limits of contrastive learning as a way to train unsupervised
dense retrievers, and show that it leads to strong retrieval performance. More precisely, we show on the BEIR benchmark that our model outperforms BM25 on 11 out of 15 datasets.
Furthermore, when a few thousands examples are available, we show that fine-tuning our model on these leads to strong improvements compared to BM25. Finally, when
used as pre-training before fine-tuning on the MS MARCO dataset, our technique obtains state-of-the-art results on the BEIR
In-context learning is a recent paradigm in natural language understanding, where a large pre-trained language model (LM) observes a test instance and a few
training examples as its input, and directly decodes the output without any update to its parameters. However, performance has been shown to strongly depend on the
selected training examples (termed prompt). In this work, we propose an efficient method for retrieving prompts for in-context learning using annotated data and a
LM. Given an input-output pair, we estimate the probability of the output given the input and a candidate training example as the prompt, and label training
examples as positive or negative based on this probability. We then train an efficient dense retriever from this data, which is used to retrieve training examples
as prompts at test time. We evaluate our approach on three sequence-to-sequence tasks where language utterances are mapped to meaning representations, and find
that it substantially outperforms prior work and multiple baselines across the board.
It has been shown that dual encoders trained on one domain often fail to generalize to other domains for retrieval tasks. One widespread belief is that the
bottleneck layer of a dual encoder, where the final score is simply a dot product
between a query vector and a passage vector, is too limited to make dual encoders an effective retrieval model for out-of-domain generalization.
In this paper, we challenge this belief by scaling up the size of the dual encoder model while keeping the bottleneck embedding size fixed. With
multi-stage training, surprisingly, scaling up the model size brings substantial improvement on a variety of retrieval tasks, especially for out-of-domain
Experimental results show that our dual encoders, Generalizable T5-based dense Retrievers (GTR), outperform existing sparseand dense retrievers on the BEIR dataset (Thakur et al 2021) substantially. Most
surprisingly, our ablation study finds that GTR is very data efficient, as it only needs 10% of MS Marco supervised data to achieve the best
out-of-domain performance. All the GTR modelsare
We propose DrBoost, a dense retrieval ensemble inspired by boosting. DrBoost is trained in stages: each component model is learned sequentially and specialized
by focusing only on retrieval mistakes made by the current ensemble. The final representation is the concatenation of the output vectors of all the component
models, making it a drop-in replacement for standard dense retrievers at test time. DrBoost enjoys several advantages compared to standard dense retrieval models.
It produces representations which are 4× more compact, while delivering comparable retrieval results. It also performs surprisingly well under approximate search
with coarse quantization, reducing latency and bandwidth needs by another 4×. In practice, this can make the difference between serving indices from disk versus
from memory, paving the way for much cheaper deployments.
Recent works for Open-domain Question Answering refer to an external knowledge base using a retriever model, optionally rerank the passages with a separate
reranker model and generate an answer using an another reader model. Despite performing related tasks, the models have separate parameters and are weakly-coupled
during training. In this work, we propose casting the retriever and the reranker as hard-attention mechanisms applied sequentially within the transformer
architecture and feeding the resulting computed representations to the reader. In this singular model architecture the hidden representations are progressively
refined from the retriever to the reranker to the reader, which is more efficient use of model capacity and also leads to better gradient flow when we train it in
an end-to-end manner. We also propose a pre-training methodology to effectively train this architecture. We evaluate our model on Natural Questions and TriviaQA
open datasets and for a fixed parameter budget, our model outperforms the previous state-of-the-art model by 1.0 and 0.7 exact match scores.
We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens.
With a 2 trillion token database, our Retrieval-Enhanced Transformer(RETRO) obtains comparable performanceto
GPT-3 and Jurassic-1 on the Pile, despite using 25× fewerparameters. After fine-tuning, RETRO performance translates todownstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more
data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidlyRETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues
for improving language models through explicit memory at unprecedented scale.
Most of today’s AI systems focus on using self-attention mechanisms and transformer architectures on large amounts of diverse data to achieve impressive
performance gains. In this paper, we propose to augment the transformer architecture with an external attention mechanism to bring external knowledge and context
to bear. By integrating external information into the prediction process, we hope to reduce the need for ever-larger models and increase the democratization of AI
systems. We find that the proposed external attention mechanism can significantly improve the performance of existing AI systems, allowing practitioners to easily
customize foundation AI models to many diverse downstream applications. In particular, we focus on the task of Commonsense Reasoning, demonstrating that the
proposed external attention mechanism can augment existing transformer models and significantly improve the model’s reasoning capabilities. The proposed system,
Knowledgeable External Attention for commonsense Reasoning (KEAR), reaches human parity on the open CommonsenseQA
research benchmark with an accuracy of 89.4% in comparison to the human accuracy of 88.9%.
Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks,
similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream
tasks, are critical for this mission to solve real-world computer vision applications. While existing vision foundation models such as CLIP, ALIGN, and Wu Dao 2.0 focus mainly on mapping images and textual representations to a
cross-modal shared representation, we introduce a new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine
(object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth). By
incorporating universal visual-language representations from Web-scale image-text data, our Florence model can be easily adapted for various computer vision tasks,
such as classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition.
Moreover, Florence demonstrates outstanding performance in many types of transfer learning: fully sampled fine-tuning, linear probing, few-shot transfer and
zero-shot transfer for novel images and objects. All of these properties are critical for our vision foundation model to serve general purpose vision tasks.
Florence achieves new state-of-the-art results in majority of 44 representative benchmarks, eg. ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and the top-5
accuracy of 97.18, 62.4 mAP on COCO fine tuning, 80.36on VQA, and 87.8 on
This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models
while still taking advantage of their pre-training. In our empirical study we find that locked pre-trained image models with unlocked text models work best. We
call this instance of contrastive-tuning “Locked-image Text tuning” (LiT-tuning), which just teaches a text model to read
out good representations from a pre-trained image model for new tasks. A LiT-tuned model gains the capability of zero-shot transfer to new vision tasks, such as
image classification or retrieval. The proposed LiT-tuning is widely applicable; it works reliably with multiple pre-training methods (supervised and unsupervised)
and across diverse architectures (ResNet, Vision Transformers and MLP-Mixer) using three different image-text datasets. With the transformer-based pre-trained ViT-g/14 model, the LiT-tuned
model achieves 84.5% zero-shot transfer accuracy on the ImageNet test set, and 81.1% on the challenging out-of-distribution
ObjectNet test set.
The selection of text color is a time-consuming and important aspect in the designing of visual-textual presentation layout.
In this paper, we propose a novel deep neural network architecture for predicting text color in the designing of visual-textual presentation layout. The
proposed architecture consists of a text colorization network, a color harmony scoring network, and a text readability scoring network. The color harmony scoring
network is learned by training with color theme data with aesthetic scores. The text readability scoring network is learned by training with design works. Finally,
the text colorization network is designed to predict text colors by maximizing both color harmony and text readability, as well as learning from designer’s choice
In addition, this paper conducts a comparison with other methods based on random generation, color theory rules or similar features search.
Both quantitative and qualitative evaluation results demonstrate that the proposed method has better performance.
[Keywords: text colorization, color harmonization, text readability, visual-textual presentation design]
Color Combination Aesthetics Score Dataset: We obtained the Mechanical Turk public dataset from,14 which consists of 10,743
carefully selected color themes created by users on Adobe Kuler,1 covering a wide range of highly and poorly rated color themes, each of which rated
by at least 3 random users with ratings between 1 and 5. The Mechanical Turk dataset uses Amazon Mechanical Turk1 to collect more user ratings for the selected
topics, making each topic rated by 40 users. Finally, the average score for each topic was taken as the final score.
Visual-Textual Design Works Dataset: We constructed a visual-textual design dataset called VTDSet
(Visual-Textual Design Set) where 10 designers selected text colors in 5 to 7 areas on each of the total 1,226 images, resulting in 77,038 designed text
colors and their corresponding information. We randomly selected 10,000 design results associated with 1,000 background images from the dataset as the training
dataset, and 2,260 design results associated with the remaining 226 background images as the testing dataset.
…4.4 Comparison with Other Methods: We compare the text colorization network HTCN proposed in this
paper with the following 3 approaches:
Random Text Colorization (“Random”). A random value is selected in the RGB color space, and this
baseline is used to check whether the color design of the text in the generation of the visual-textual presentation layout is arbitrary.
Text Colorization Based on Matsuda Color Wheel Theory
(“Matsuda CW”). This text colorization method bases on the color wheel theory, which is also adopted in the work of Yang et al.18 We reproduce the
method by first performing principal component analysis on the image to obtain the color theme, taking the color with the largest proportion as the base color
Cd of the image, and then calculating the minimum harmonic color wheel distance between the base color Cd and the
aesthetic template color set according to the constraint defined by Matsuda to obtain the optimal hue value of the text color Cr. Finally,
the color mean μh,s,v of the image covered by the text area is calculated, and the optimal text color is obtained by reasonably maximizing
the distance between μh,s,v and Cr in the (s, v) saturation and luminance space.
Text Colorization Based on Image Feature Retrieval (“Retrieval”). Retrieval-based strategy is frequently used in design, i.e., seeking
reference among solutions of similar problems. For the text colorization problem, the original designer’s color can become the recommended color when the
background image and the text area are similar. As a result, we concatenate the global features of the image and the local image features of the text-covered
region to obtain the K nearest neighbor recommendations for the current text coloring by the cosine distance. We used the VGG-16
network 15 pretrained on the ImageNet dataset, and selected the output of the fc6 layer as the image features. The combined feature of
the text region image Itext on the global image I is f=<VGGI,VGGItext~~>. The text color corresponding to
the feature with greatest similarity in the design library is selected for colorization.
Contrastive learning with the InfoNCE objective is exceptionally successful in various self-supervised learning tasks. Recently, the
CLIP model yielded impressive results on zero-shot transfer learningwhen using InfoNCE for learning visual representations from naturallanguage supervision. However, InfoNCE as a lower bound on the mutual information has been shown to perform poorly for high mutual information. In contrast, the
InfoLOOB upper bound (leave one out bound) works well for high mutual information but suffers from large variance and
instabilities. We introduce “Contrastive Leave One Out Boost” (CLOOB), where modern Hopfield networks boost learning with
theInfoLOOB objective. Modern Hopfield networks replace the originalembeddings by retrieved
embeddings in the InfoLOOB objective. Theretrieved embeddings give InfoLOOB
two assets. Firstly, the retrievedembeddings stabilize InfoLOOB, since they are less noisy and more
similar to one another than the original embeddings. Secondly, they are enriched by correlations, since the covariance structure of embeddings is reinforced
through retrievals. We compare CLOOB to CLIPafter learning on the Conceptual
Captions and the YFCC dataset with respect to their zero-shot transfer learning performance on other datasets.
CLOOB consistently outperforms CLIP at zero-shot transfer learning across all
considered architectures and datasets.
A deep hashing model typically has two main learning objectives: to make the learned binary hash codes discriminative and to minimize a quantization error. With
further constraints such as bit balance and code orthogonality, it is not uncommon for existing models to employ a large number (>4) of losses. This leads to
difficulties in model training and subsequently impedes their effectiveness. In this work, we propose a novel deep hashing model with only a single learning
objective. Specifically, we show that maximizing the cosine similarity between the continuous codes and their corresponding binary orthogonal codes can ensure both
hash code discriminativeness and quantization error minimization. Further, with this learning objective, code balancing can be achieved by simply using a Batch
Normalization (BN) layer and multi-label classification is also straightforward with label smoothing. The result is an one-loss deep hashing model that removes all
the hassles of tuning the weights of various losses. Importantly, extensive experiments show that our model is highly effective, outperforming the state-of-the-art
multi-loss hashing models on three large-scale instance retrieval benchmarks, often by significant margins. Code is available at https://github.com/kamwoh/orthohash
While large scale pre-training has achieved great achievements in bridging the gap between vision and language, it still faces several challenges. First, the
cost for pre-training is expensive. Second, there is no efficient way to handle the data noise which degrades model performance. Third, previous methods only
leverage limited image-text paired data, while ignoring richer single-modal data, which may result in poor generalization to single-modal downstream tasks. In
this work, we propose an EfficientCLIP method via Ensemble Confident Learning to obtain a less noisy data subset.
Extra rich non-paired single-modal text data is used for boosting the generalization of text branch. We achieve the state-of-the-art performance on Chinese
cross-modal retrieval tasks with only 1⁄10th training resources compared to CLIP andWenLan, while showing excellent generalization to
single-modal tasks, including text retrieval and text classification.
CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model
that jointly learns representations of images and texts. The model is trained on a massive amount of English data and shows impressive performance on zero-shot
classification tasks. Training the same model on a different language is not trivial, since data in other languages might be not enough and the model needs
high-quality translations of the texts to guarantee a good performance. In this paper, we present the first CLIP model
for theItalian Language (CLIP-Italian), trained on more than 1.4 millionimage-text pairs. Results
show that CLIP-Italian outperforms themultilingual CLIP model on the tasks of
image retrieval and zero-shot classification.
Large-scale pretraining of visual representations has led to state-of-the-art performance on a range of benchmark computer vision tasks, yet the benefits of
these techniques at extreme scale in complex production systems has been relatively unexplored. We consider the case of a popular visual discovery product, where
these representations are trained with multi-task learning, from use-case specific visual understanding (e.g. skin tone classification) to general
representation learning for all visual content (e.g. embeddings for retrieval). In this work, we describe how we (1) generate a dataset with over a billion images
via large weakly-supervised pretraining to improve the performance of these visual representations, and (2) leverage Transformers to replace the traditional convolutional backbone, with insights into both system and performance improvements, especially
at 1B+ image scale. To support this backbone model, we detail a systematic approach to deriving weakly-supervised image annotations from heterogeneous text
signals, demonstrating the benefits of clustering techniques to handle the long-tail distribution of image labels. Through a comprehensive study of offline and
online evaluation, we show that large-scale Transformer-based pretraining provides significant benefits to industry computer vision applications. The model is
deployed in a production visual shopping system, with 36% improvement in top-1 relevance and 23% improvement in click-through volume. We conduct extensive
experiments to better understand the empirical relationships between Transformer-based architectures, dataset scale, and the
performance of production vision systems.
We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an
end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction
between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage
framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively
small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP)
model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the
tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major
text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT,
MSVD and VATEX.
Recent advances in large-scale pre-training such as GPT-3 allow seemingly high quality text to be generated
from a given prompt. However, such generation systems often suffer from problems of hallucinated facts, and are not inherently designed to incorporate useful
external information. Grounded generation models appear to offer remedies, but their training typically relies on rarely-available parallel data where
information-relevant documents are provided for context. We propose a framework that alleviates this data constraint by jointly training a grounded generator and
document retriever on the language model signal. The model learns to reward retrieval of the documents with the highest utility in generation, and attentively
combines them using a Mixture-of-Experts (MoE) ensemble to generate follow-on text. We demonstrate that both generator and retriever can take advantage of this
joint training and work synergistically to produce more informative and relevant text in both prose and dialogue generation.
Understanding and creating mathematics using natural mathematical language—the mixture of symbolic and natural language used by humans—is a challenging and
important problem for driving progress in machine learning. As a step in this direction, we develop NaturalProofs, a multi-domain corpus of mathematical statements
and their proofs, written in natural mathematical language. NaturalProofs unifies broad coverage, deep coverage, and low-resource mathematical sources, allowing
for evaluating both in-distribution and zero-shot generalization. Using NaturalProofs, we benchmark strong neural methods on mathematical reference retrieval and
generation tasks which test a system’s ability to determine key results that appear in a proof. Large-scale sequence models show promise compared to classical
information retrieval methods, yet their performance and out-of-domain generalization leave substantial room for improvement. NaturalProofs opens many avenues for
research on challenging mathematical tasks.
[Fun note: the corpus uses The Pile.] In a bid to promote the research
and development of China’s own large-scale pretraining models and further explore universal intelligence from a more fundamental perspective, the Beijing
Academy of Artificial Intelligence (BAAI) recently unveiled Wu Dao 1.0, China’s first homegrown
super-scale intelligent model system. The work was led by BAAI Research Academic Vice President and Tsinghua
University Professor Tang Jie, with contributions from a team of more than 100 AI scientists from Peking University, Tsinghua University, Renmin University of
China, Chinese Academy of Sciences and other institutes.
Wu Dao 1.0 has initiated large-scale research projects via 4 related models: Wu Dao—Wen Yuan, Wu Dao—Wen Lan, Wu Dao—Wen Hui, and Wu Dao—Wen Su.
Wu Dao—Wen Yuan: is China’s largest-ever pretraining language model, boasting the best processing power in mainstream languages, including
Chinese and English. It has surpassed average human performance benchmarks on text categorization, sentiment analysis, natural language inference, reading
comprehension and more. The Wu Dao—Wen Yuan project is designed to explore universal natural language understanding (NLU) techniques and study brain-inspired language models. It has 2.6 billion parameters and is capable of performing cognitive
activities such as memorization, comprehension, retrieval, numerical calculation, multi-language, etc. Wu Dao—Wen Yuan has achieved GPT-3comparable performance on 20 Chinese NLP tasks such as open-domain
answering, grammar correction, sentiment analysis, etc.
…Wen Yuan introduces the open-source Chinese pretraining model (CPM). Based on CPM, theCPM-Distill model reduces language confusion by 38% and achieves better results on downstream tasks.
Wu Dao—Wen Lan: meanwhile, is the first publicly available Chinese universal graphic multimodal pretraining model. The ultra-large-scale
multimodal pretraining model aims to break through the theoretical challenges of pretraining multimodal data based on a combination of graphics, text and
video, and eventually generate industrial-grade Chinese graphics pretraining models and applications that exceed SOTA
performance. Currently, the model has 1 billion parameters and is trained on 50 million graphic pairs collected from open sources. The Wu Dao—Wen Lan
model has reached SOTA performance, scoring 5% higher than the champion team on the Image Caption task on the
Chinese public multimodal test set AIC-ICC and 20% higher than the most
popular UNITER model on the Visual Entailment task.
…Wen Lan is the first Chinese generic multimodal pretraining model that can understand “connotative information” based on weak correlations of images and
text. Wen Lan uses an advanced cross-modal contrast learning algorithm: Given an image-text pair, it can enlarge the number of negative samples for each modal,
especially for those which are difficult to distinguish, further improving the expression ability of neural networks. It can easily replace image and text
encoders with the most advanced single-mode pretraining model, achieving 20× faster performance than the UNITER model.
Wu Dao—Wen Hui: is an ultra-large-scale cognitive-oriented pretraining model that focuses on a series of essential problems in general
artificial intelligence from a cognitive perspective, aiming to develop and enhance the logic/consciousness/reasoning-based cognitive capabilities of
pretraining models. Wu Dao—Wen Hui has reached 11.3 billion parameters, and through simple fine-tuning can generate poetry, make videos, draw pictures,
retrieve text, perform complex reasoning, etc. BAAI says the model achieves near-human performance on poetry
generation on the Turing test.
…Wen Hui proposes a new pretraining paradigm, Generative Language Model, breaking the bottlenecks of BERTand GPT. For the first time in history, a single model has achieved the best results in language understanding and generating tasks,
and surpassed common pretraining models such as BERT, RoBERTa and T5 that trained on the same
volume of data. Wen Hui’s continuous vector based fine-tuning method, P-tuning, is the first autoregressive model that surpasses the autoencoder model in NLUtasks and has achieved SOTA results on more than 10 tasks such as Knowledge Extraction and SuperGLUE Few-shot Learning, with over 20% performance improvement. Wen Hui’s inverse prompting algorithmachieves close to human performance on the task of Q&A and poetry generation, and is the first model that
can generate classical Chinese poetry based on modern themes.
Wu Dao—Wen Su: is a large-scale training model for biomolecular structure prediction. It can handle super long biomolecular
structures, where it has achieved SOTA performance, interpretability and robustness. Based on Google’s BERT language model, Wu Dao—Wen Su has completed protein training on the 100 GB
UNIPARC database and gene training on 5–100,000 human peripheral blood immune cells (25–30 cell types) and
10,000 drug-resistant bacteria.
…Wen Su’s open-sourced FastMoE is the first high-performance MoE (Mixture-of-Experts Model)
system that supports the PyTorch framework and a variety of hardware. Only one line of code is required to complete the MoE transformation, and model training
speed is increased by 47× compared with the traditional PyTorch implementation.
[blog] Pre-trained representations are becoming crucial for many
NLP andperception tasks. While representation learning in NLP has
transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are
expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions,
MS-COCO, or CLIP all involve a non-trivial data collection
(and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models.
In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the
Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual
representation achieves strong performance when transferred to classification tasks such as ImageNetand VTAB. The aligned visual and language representations also set new state-of-the-art results on Flickr30K and MSCOCO benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality
search with complex text and text + image queries.
Recently multimodal transformer models have gained popularity because their performance on language and vision tasks suggest they learn rich visual-linguistic
representations. Focusing on zero-shot image retrieval tasks, we study three important factors which can impact the quality of learned representations: pretraining
data, the attention mechanism, and loss functions. By pretraining models on six datasets, we observe that dataset noise and language similarity to our downstream
task are important indicators of model performance. Through architectural analysis, we learn that models with a multimodal attention mechanism can outperform
deeper models with modality specific attention mechanisms. Finally, we show that successful contrastive losses used in the self-supervised learning literature do not yield similar performance gains when used in multimodal transformers
We classify and re-examine some of the current approaches to improve the performance-computes trade-off of language models, including (1) non-causal models
(such as masked language models), (2) extension of batch length with efficient attention, (3) recurrence, (4) conditional computation and (5) retrieval.
We identify some limitations (1)—(4) suffer from. For example, (1) currently struggles with open-ended text generation with the output loosely constrained by
the input as well as performing general textual tasks like GPT-2/3 due to its need for a specific fine-tuning dataset. (2) and (3) do not
improve the prediction of the first ~103 tokens. Scaling up a model size (eg. efficiently with (4)) still results in poor performance scaling for some
We argue (5) would resolve many of these limitations, and it can (a) reduce the amount of supervision and (b) efficiently extend the context over the entire
training dataset and the entire past of the current sample. We speculate how to modify MARGE to perform
unsupervised causal modeling that achieves (b) with the retriever jointly trained.
We introduce MARGE, a pre-trained sequence-to-sequence model learned with an unsupervised multi-lingual
multi-document paraphrasing objective. MARGE provides an alternative to the dominant masked language modeling
paradigm, where we self-supervise the reconstruction of target text by retrieving a set of related texts (in many languages) and conditioning on them to maximize
the likelihood of generating the original. We show it is possible to jointly learn to do retrieval and reconstruction, given only a random initialization. The
objective noisily captures aspects of paraphrase, translation, multi-document summarization, and information retrieval, allowing for strong zero-shot performance
on several tasks. For example, with no additional task-specific training we achieve BLEU scores of up to 35.8 for document translation. We further show that
fine-tuning gives strong performance on a range of discriminative and generative tasks in many languages, making MARGE
the most generally applicable pre-training method to date.
We present M3P, a Multitask Multilingual Multimodal Pre-trained model that combines multilingual pre-training and multimodal pre-training into a unified
framework via multitask pre-training. Our goal is to learn universal representations that can map objects occurred in different modalities or texts expressed in
different languages into a common semantic space. In addition, to explicitly encourage fine-grained alignment between images and non-English languages, we
also propose Multimodal Code-switched Training (MCT) to combine monolingual pre-training and multimodal
pre-training via a code-switch strategy. Experiments are performed on the multilingual image retrieval task across two benchmark datasets, including
MSCOCO and Multi30K. M3P can achieve comparable results for English and new state-of-the-art results for non-English
Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on
downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on
knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their
world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit
non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning
recipe for retrieval-augmented generation (RAG)—models which combine pre-trained parametric and non-parametric
memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model
and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, the other can use
different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and
set the state-of-the-art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures.
For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a
state-of-the-art parametric-only seq2seq baseline.
Current multilingual vision-language models either require a large number of additional parameters for each supported language, or suffer performance
degradation as languages are added. In this paper, we propose a Scalable Multilingual Aligned Language Representation (SMALR) that supports many languages with few model parameters withoutsacrificing downstream task performance. SMALR learns a fixed size language-agnostic representation for most words in a multilingual vocabulary, keeping language-specific
features for just a few. We use a masked cross-language modeling loss to align features with context from other languages. Additionally, we propose a cross-lingual
consistency module that ensures predictions made for a query and its machine translation are comparable. The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date. We evaluate on
multilingual image-sentence retrieval and outperform prior work by 3–4% with less than 1/5th the training parameters compared to other word embedding
Language model pre-training has been shown to capture a surprising amount of world knowledge, crucial for NLP tasks
such as question answering. However, this knowledge is stored implicitly in the parameters of a neural network, requiring ever-larger networks to cover more
To capture knowledge in a more modular and interpretable way, we augment language model pre-training with a latentknowledge retriever, which allows the model to
retrieve and attend over documents from a large corpus such as Wikipedia, used during pre-training, fine-tuning and inference. For the first time, we show how to
pre-train such a knowledge retriever in an unsupervised manner, using masked language modeling as the learning signal and backpropagating through a retrieval step
that considers millions of documents.
We demonstrate the effectiveness of Retrieval Augmented Language Model pre-training (REALM) by fine-tuning on the
challenging task of Open-domain Question Answering (Open-QA).We compare against state-of-the-art models for both explicit and implicit knowledge storage on
three popular Open-QA benchmarks, and find that we outperform all previous methods by a substantial margin (4–16% absolute accuracy), while also providing
qualitative benefits such as interpretability and modularity.
It has recently been observed that neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language
queries. In this short paper, we measure the practical utility of this approach by fine-tuning pre-trained models to answer questions without access to any
external context or knowledge. We show that this approach scales surprisingly well with model size and outperforms models that explicitly look up knowledge on
the open-domain variants of Natural Questions and WebQuestions.
Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be
incorporated into existing vision-language methods in order to support many languages. We accomplish this by learning a single shared Multimodal Universal
Language Embedding (MULE) which has been visually-semantically aligned across all languages. Then we learn to
relate MULE to visual data as if it were a single language. Our method is not architecture specific, unlike prior
work which typically learned separate branches for each language, enabling our approach to easily be adapted to many vision-language methods and tasks. Since
MULE learns a single language branch in the multimodal model, we can also scale to support many languages, and
languages with fewer annotations can take advantage of the good representation learned from other (more abundant) language data. We demonstrate the effectiveness
of MULE on the bidirectional image-sentence retrieval task, supporting up to four languages in a single model. In
addition, we show that Machine Translation can be used for data augmentation in multilingual learning, which, combined with MULE,
improves mean recall by up to 21.9% on a single-language compared to prior work, with the most substantial gains seen on languages with relatively few
annotations. Our code is publicly available.
We introduce the first large-scale corpus for long-form question answering, a task requiring elaborate and in-depth answers to open-ended questions. The dataset
comprises 270K threads from the Reddit forum “Explain Like I’m Five” (ELI5) where an online community provides
answers to questions which are comprehensible by five year olds. Compared to existing datasets, ELI5 comprises diverse
questions requiring multi-sentence answers. We provide a large set of web documents to help answer the question. Automatic and human evaluations show that
an abstractive model trained with a multi-task objective outperforms conventional Seq2Seq, language modeling, as well as a strong extractive baseline. However, our
best model is still far from human performance since raters prefer gold responses in over 86% of cases, leaving ample opportunity for future improvement.
Industrial recommender systems deal with extremely large action spaces—many millions of items to recommend. Moreover, they need to serve billions of users, who
are unique at any point in time, making a complex user state space. Luckily, huge quantities of logged implicit feedback (eg. user clicks, dwell time) are
available for learning. Learning from the logged feedback is however subject to biases caused by only observing feedback on recommendations selected by the
previous versions of the recommender. In this work, we present a general recipe of addressing such biases in a production top-K recommender system at Youtube,
built with a policy-gradient-based algorithm, ie. REINFORCE. The contributions of the paper are: (1) scaling REINFORCE to a production
recommender system with an action space on the orders of millions; (2) applying off-policy correction to address data biases in learning from logged feedback
collected from multiple behavior policies; (3) proposing a novel top-K off-policy correction to account for our policy recommending multiple items at a time; (4)
showcasing the value of exploration. We demonstrate the efficacy of our approaches through a series of simulations and multiple live experiments on Youtube.
Modeling of music audio semantics has been previously tackled through learning of mappings from audio data to high-level tags or latent unsupervised spaces. The
resulting semantic spaces are theoretically limited, either because the chosen high-level tags do not cover all of music semantics or because audio data itself is
not enough to determine music semantics. In this paper, we propose a generic framework for semantics modeling that focuses on the perception of the listener,
through EEG data, in addition to audio data. We implement this framework using a novel end-to-end 2-view Neural Network
(NN) architecture and a Deep Canonical Correlation Analysis (DCCA) loss function that forces the semantic
embedding spaces of both views to be maximally correlated. We also detail how the EEG dataset was collected and
use it to train our proposed model. We evaluate the learned semantic space in a transfer learning context, by using it as an audio feature extractor in an
independent dataset and proxy task: music audio-lyrics cross-modal retrieval. We show that our embedding model outperforms Spotify features and performs comparably
to a state-of-the-art embedding model that was trained on 700 times more data. We further discuss improvements to the model that are likely to improve its
Though deep neural networks have great success in natural language processing, they are limited at more knowledge intensive AI tasks, such as open-domain
Question Answering (QA). Existing end-to-end deep QA models need to process the entire text after observing the question, and therefore their complexity
in responding a question is linear in the text size. This is prohibitive for practical tasks such as QA from Wikipedia, a novel, or the Web. We propose to solve
this scalability issue by using symbolic meaning representations, which can be indexed and retrieved efficiently with complexity that is independent of the text
size. We apply our approach, called the N-Gram Machine (NGM), to three representative tasks. First asproof-of-concept, we demonstrate that NGM successfully solves thebAbItasks of synthetic text. Second,
we show that NGM scales to large corpus by experimenting on “life-long bAbI”, a special version of bAbI that contains
millions of sentences. Lastly on the WikiMovies dataset, we use NGM to inducelatent structure (ie. schema) and answer questions from natural language Wikipedia text, with only QA pairs as weak supervision.
A significant amount of the world’s knowledge is stored in relational databases. However, the ability for users to retrieve facts from a database is limited due
to a lack of understanding of query languages such as SQL. We propose Seq2SQL, a deep
neural network for translatingnatural language questions to corresponding SQL queries. Our modelleverages the structure of SQL queries to significantly reduce the output space of generated queries. Moreover, we
use rewards from in-the-loop query execution over the database to learn a policy to generate unordered parts of the query, which we show are less suitable for
optimization via cross entropy loss. In addition, we will
publish WikiSQL, a dataset of 80,654 hand-annotated examples of questions andSQL queries distributed across 24,241 tables from Wikipedia. This dataset is required to train our model and is an order of
magnitude larger than comparable datasets. By applying policy-based reinforcement learning with a query execution environment to WikiSQL,our model Seq2SQL outperforms attentional sequence to sequence models,
improving execution accuracy from 35.9% to 59.4% and logical form accuracy from 23.4% to 48.3%.
YouTube represents one of the largest scale and most sophisticated industrial recommendation systems in existence. In this paper, we describe the system at a
high level and focus on the dramatic performance improvements brought by deep learning. The paper is split according to the classic two-stage information retrieval
dichotomy: first, we detail a deep candidate generation model and then describe a separate deep ranking model [since upgraded to REINFORCE]. We also
provide practical lessons and insights derived from designing, iterating and maintaining a massive recommendation system with enormous user-facing impact.
[Keywords: recommender system, deep learning, scalability]
Despite recent breakthroughs in the applications of deep neural networks, one setting that presents a persistent challenge is that of “one-shot learning.”
Traditional gradient-based networks require a lot of data to learn, often through extensive iterative training. When new data is encountered, the models must
inefficiently relearn their parameters to adequately incorporate the new information without catastrophic interference. Architectures with augmented memory
capacities, such as Neural Turing Machines (NTMs), offer the ability to quickly encode and retrieve new
information, and hence can potentially obviate the downsides of conventional models. Here, we demonstrate the ability of a memory-augmented neural network to
rapidly assimilate new data, and leverage this data to make accurate predictions after only a few samples. We also introduce a new method for accessing an external
memory that focuses on memory content, unlike previous methods that additionally use memory location-based focusing mechanisms.
Is it possible to build a system to determine the location where a photo was taken using just its pixels? In general, the problem seems exceptionally difficult:
it is trivial to construct situations where no location can be inferred. Yet images often contain informative cues such as landmarks, weather patterns, vegetation,
road markings, and architectural details, which in combination may allow one to determine an approximate location and occasionally an exact location. Websites such
as GeoGuessr and View from your Window suggest that humans are relatively good at integrating these cues to geolocate images, especially en-masse. In computer
vision, the photo geolocation problem is usually approached using image retrieval methods. In contrast, we pose the problem as one of classification by subdividing
the surface of the earth into thousands of multi-scale geographic cells, and train a deep network using millions of geotagged images. While previous approaches
only recognize landmarks or perform approximate matching using global image descriptors, our model is able to use and integrate multiple visible cues. We show that
the resulting model, called PlaNet, outperforms previous approaches and even attains superhuman levels of accuracy in some cases. Moreover, we extend our model to
photo albums by combining it with a long short-term
memory (LSTM) architecture. By learning to exploit temporal coherence to geolocate uncertain photos, we
demonstrate that this model achieves a 50% performance improvement over the single-image model.