Neural code synthesis has reached a point where snippet generation is accurate enough to be considered for integration into human software development workflows. Commercial products aim to increase programmers’ productivity, without being able to measure it directly.
In this case study, we asked users of GitHub Copilot about its impact on their productivity, and sought to find a reflection of their perception in directly measurable user data.
We find that the rate with which shown suggestions are accepted, rather than more specific metrics regarding the persistence of completions in the code over time, drives developers’ perception of productivity.
…In this work we define acceptance rate as the fraction of completions shown to the developer that are subsequently accepted for inclusion in the source file. The IntelliCode Compose system uses the term CTR (Click Through Rate) for this and reports a value of 10% in online trials [12]. An alternative measure is that of DCPU (Daily Completions accepted Per User) for which a value of around 20 has been reported [3, 21]. To calculate acceptance rate one must, of course, normalize DCPU by the time spent coding each day. For context, in our study GitHub Copilot has an acceptance rate of 27% and a mean DCPU in excess of 31.
…Language Use: We are aware that there are substantial differences for how GitHub Copilot performs for different programming languages. The most common languages among our user base are TypeScript (24.7% of all shown completions in the observed time frame, 21.9% for users in survey), JavaScript (21.3%, 24.2%), and Python (14.1%, 14.5%). The latter 2 enjoy higher acceptance rates, possibly hinting at a relative strength of neural tooling versus deductive tooling for untyped languages. Regardless of language, survey participants had a slightly higher acceptance rate than the whole user base.
Figure 5: Programming language use by survey participants vs. all users.
…We were surprised to find that acceptance rate (number of acceptances normalized by the number of shown completions) was better correlated with reported productivity than our measures of persistence.
But in hindsight, this makes sense. Coding is not typing, and GitHub Copilot’s central value lies not in being the way the user enters the highest possible number of lines of code. Instead, it lies in helping the user to make the best progress towards their goals. A suggestion that serves as a useful template to tinker with may be as good or better than a perfectly correct (but obvious) line of code that only saves the user a few keystrokes.
This suggests that a narrow focus on the correctness of suggestions would not tell the whole story for these kinds of tooling. Instead one could view code suggestions inside an IDE to be more akin to a conversation with a chatbot. We see anecdotal evidence of this in comments posted about GitHub Copilot online (see Appendix E for examples) in which users talk about sequences of interactions. A conversation turn in this context consists of the prompt in the completion request and the reply as the completion itself. The developer’s response to the completion arises from the subsequent changes which are incorporated in the next prompt to the model. And there are clear programming parallels to factors such as specificity and repetition that have been identified to affect human judgements of conversation quality [11]. Researchers have already investigated the benefits of natural language feedback to guide program synthesis [2] and so ours is not a radical proposal. But neither is it one we have seen followed.
In future work, we wish to further explore this analogy, borrowing ideas [16] from the evaluation of chatbots and natural language text generation.
Symbolic regression, the task of predicting the mathematical expression of a function from the observation of its values, is a difficult task which usually involves a two-step procedure: predicting the “skeleton” of the expression up to the choice of numerical constants, then fitting the constants by optimizing a non-convex loss function. The dominant approach is genetic programming, which evolves candidates by iterating this subroutine a large number of times. Neural networks have recently been tasked to predict the correct skeleton in a single try, but remain much less powerful. In this paper, we challenge this two-step procedure, and task a Transformer to directly predict the full mathematical expression, constants included. One can subsequently refine the predicted constants by feeding them to the non-convex optimizer as an informed initialization. We present ablations to show that this end-to-end approach yields better results, sometimes even without the refinement step. We evaluate our model on problems from the SRBench benchmark and show that our model approaches the performance of state-of-the-art genetic programming with several orders of magnitude faster inference.
We perform an empirical evaluation of Text-to-SQL capabilities of the Codex language model. We find that, without any finetuning, Codex is a strong baseline on the Spider benchmark; we also analyze the failure modes of Codex in this setting. Furthermore, we demonstrate on the GeoQuery and Scholar benchmarks that a small number of in-domain examples provided in the prompt enables Codex to perform better than state-of-the-art models finetuned on such few-shot examples.
…Prompt design is critical for performance: As seen in Table 2, providing the question alone results in a low 8.3% execution accuracy. There is a progressive improvement to 56.8% as schema information is introduced in ‘API Docs’, to 59.9% when valid SQL and foreign key information is used in ‘Create Table’, and to 67.0% when database content is introduced with ‘Create Table + Select 3’.
Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs (eg. Codex (Chen et al 2021)) are not publicly available, leaving many questions about their model and data design decisions.
We aim to fill in some of these blanks through a systematic evaluation of the largest existing models: Codex, GPT-J, GPT-Neo, GPT-NeoX-20B, and CodeParrot, across various programming languages. Although Codex itself is not open-source, we find that existing open-source models do achieve close results in some programming languages, although targeted mainly for natural language modeling.
We further identify an important missing piece in the form of a large open-source model trained exclusively on a multi-lingual corpus of code. We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine. In the C programming language, PolyCoder outperforms all models including Codex.
Our trained models are open-source and publicly available at https://github.com/VHellendoorn/Code-LMs, which enables future research and application in this area.
[blog; examples; FIQA benchmarks; Viable case-study] Text embeddings are useful features in many applications such as semantic search and computing text similarity. Previous work typically trains models customized for different use cases, varying in dataset choice, training objective and model architecture.
In this work, we show that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models.
On linear-probe classification accuracy averaging over 7 tasks, our best unsupervised model achieves a relative improvement of 4% and 1.8% over previous best unsupervised and supervised text embedding models respectively. The same text embeddings when evaluated on large-scale semantic search attains a relative improvement of 23.4%, 14.7%, and 10.6% over previous best unsupervised methods on MS MARCO, Natural Questions and TriviaQA benchmarks, respectively.
Similarly to text embeddings, we train code embedding models on (text, code) pairs, obtaining a 20.8% relative improvement over prior best work on code search.
Figure 1: Average performance of unsupervised cpt-text models of different sizes across 22 tasks consisting of linear-probe classification, text search, and sentence similarity tasks.
…we leverage naturally occurring paired data to construct training data with no explicit labels. Text embedding models are trained on paired text data where we consider neighboring pieces of text on the Internet as positive pairs. Code embedding models treat the top-level docstring in a function along with its implementation as a (text, code) pair. The training signal of the contrastive objective on its own is not sufficient to learn useful representations and we overcome this by initializing our model with other pretrained models (Brown et al 2020; Chen et al 2021). Finally, we find that it is critical to use a sufficiently large batch to achieve the optimal performance. We show that this simple recipe combining pre-trained model initialization, large-batch contrastive learning and training at scale, can produce text and code embeddings that possess a broad range of capabilities.
We train a series of unsupervised text embedding models (cpt-text) of different sizes, ranging from 300M to 175B parameters, and observe a consistent performance improvement with increasing model sizes (Figure 1). On classification accuracy averaging across 7 linear-probe classification tasks in SentEval (Conneau & Kiela 2018), our largest unsupervised model achieves new state-of-the-art results with a relative improvement of 4% and 1.8% over the previous best unsupervised (Giorgi et al 2020) and supervised (Gao et al 2021) text embedding models, respectively.
…Next, we train code embedding models (cpt-code) using the same recipe. Our models learn via (text, code) pairs, extracted from open source code. We evaluate our model on CodeSearchNet (Husain et al 2020), a commonly used code search benchmark, where the task is to find the most relevant code snippet given a natural language query. Our models achieve new state-of-the-art results with a 20.8% relative improvement over the previous best result (Guo et al 2021). Unlike text embedding models, we observe no performance improvement on code search when increasing the number of parameters of cpt-code from 300M to 1.2B.
Finally, we experiment with fine-tuning our models on several supervised datasets and study the transfer learning performance. When fine-tuned on NLI (Natural Language Inference) datasets, we see a further boost in linear-probe classification, outperforming the previous best transfer method (Gao et al 2021) by 2.2%. On SST-2 sentiment classification (Socher et al 2013), we find that our representations are sufficiently descriptive that even a simple k-NN classifier achieves results comparable to a linear-probe classifier. Interestingly, zero-shot performance with our embeddings outperforms the supervised neural network models introduced along with the release of the SST-2 dataset. We also fine-tune the unsupervised model on MS-MARCO and evaluate it on a suite of zero-shot search tasks in the BEIR benchmark (Thakur et al 2021). In the transfer setting, our models achieve a 5.2% relative improvement over previous methods (Izacard et al 2021) and is comparable even with methods (Santhanam et al 2021; Formal et al 2021; Wang et al 2020) that demand substantially more computation at test time.
…3.4.1. Effect Of Batch Size: Our ablation study highlights the effect of the model’s batch size on the final performance. Table 9 compares the performance of S (300M) cpt-text model trained with different batch sizes on the NQ development set. Since we train with in-batch negatives, a larger batch increases the chances of having hard negatives in a batch, resulting in a substantial performance boost.
Table 9: Performance of the cpt-text 300M model on NQ dev set given different training batch sizes.
As artificial intelligence (AI) technologies become increasingly powerful and prominent in society, their misuse is a growing concern. In educational settings, AI technologies could be used by students to cheat on assignments and exams. In this paper we explore whether transformers can be used to solve introductory level programming assignments while bypassing commonly used AI tools to detect plagiarism. We find that a student using GPT-J [Wang & Komatsuzaki 2021] can complete introductory level programming assignments without triggering suspicion from MOSS [Aiken, 2000], a widely used plagiarism detection tool. This holds despite the fact that GPT-J was not trained on the problems in question and is not provided with any examples to work from. We further find that the code written by GPT-J is diverse in structure, lacking any particular tells that future plagiarism detection techniques may use to try to identify algorithmically generated code. We conclude with a discussion of the ethical and educational implications of large language models and directions for future research.
Symbolic regression, i.e. predicting a function from the observation of its values, is well-known to be a challenging task.
In this paper, we train Transformers to infer the function or recurrence relation underlying sequences of integers or floats, a typical task in human IQ tests which has hardly been tackled in the machine learning literature.
We evaluate our integer model on a subset of OEIS sequences, and show that it outperforms built-in Wolfram Mathematica functions for recurrence prediction. We also demonstrate that our float model is able to yield informative approximations of out-of-vocabulary functions and constants, eg. bessel0(x) ≈ sin(x)+cos(x) / √πx and 1.644934 ≈ π2⁄6.
We demonstrate that a neural network pre-trained on text and fine-tuned on code solves Mathematics problems by program synthesis. We turn questions into programming tasks, automatically generate programs, and then execute them, perfectly solving university-level problems from MIT’s large Mathematics courses (Single Variable Calculus 18.01, Multivariable Calculus 18.02, Differential Equations 18.03, Introduction to Probability & Statistics 18.05, Linear Algebra 18.06, and Mathematics for Computer Science 6.042) as well as questions from a MATH dataset (on Prealgebra, Algebra, Counting and Probability, Number Theory, and Precalculus), the latest benchmark of advanced mathematics problems specifically designed to assess mathematical reasoning. We explore prompt generation methods that enable Transformers to generate question solving programs for these subjects, including solutions with plots. We generate correct answers for a random sample of questions in each topic. We quantify the gap between the original and transformed questions and perform a survey to evaluate the quality and difficulty of generated questions. This is the first work to automatically solve, grade, and generate university-level Mathematics course questions at scale which represents a milestone for higher education.
[paper] We’ve fine-tuned GPT-3 to more accurately answer open-ended questions using a text-based web browser. Our prototype copies how humans research answers to questions online—it submits search queries, follows links, and scrolls up and down web pages. It is trained to cite its sources, which makes it easier to give feedback to improve factual accuracy. We’re excited about developing more truthful AI, but challenges remain, such as coping with unfamiliar types of questions.
Language models like GPT-3 are useful for many different tasks, but have a tendency to “hallucinate” information when performing tasks requiring obscure real-world knowledge.23 To address this, we taught GPT-3 to use a text-based web-browser. The model is provided with an open-ended question and a summary of the browser state, and must issue commands such as “Search …”, “Find in page: …” or “Quote: …”. In this way, the model collects passages from web pages, and then uses these to compose an answer.
The model is fine-tuned from GPT-3 using thesamegeneralmethods we’ve used previously. We begin by training the model to copy human demonstrations, which gives it the ability to use the text-based browser to answer questions. Then we improve the helpfulness and accuracy of the model’s answers, by training a reward model to predict human preferences, and optimizing against it using either reinforcement learning or rejection sampling.
…Our models outperform GPT-3 on TruthfulQA and exhibit more favourable scaling properties. However, our models lag behind human performance, partly because they sometimes quote from unreliable sources (as shown in the question about ghosts above). We hope to reduce the frequency of these failures using techniques like adversarial training.
…Evaluating factual accuracy: …
However, this approach raises a number of questions. What makes a source reliable? What claims are obvious enough to not require support? What trade-off should be made between evaluations of factual accuracy and other criteria such as coherence? All of these were difficult judgment calls. We do not think that our model picked up on much of this nuance, since it still makes basic errors. But we expect these kinds of decisions to become more important as AI systems improve, and cross-disciplinary research is needed to develop criteria that are both practical and epistemically sound. We also expect further considerations such as transparency to be important.
Eventually, having models cite their sources will not be enough to evaluate factual accuracy. A sufficiently capable model would cherry-pick sources it expects humans to find convincing, even if they do not reflect a fair assessment of the evidence. There are already signs of this happening (see the questions about boats above). We hope to mitigate this using methods like debate.
We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web.
By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers.
We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences.
This model’s answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.
Figure 2: Human evaluations on ELI5 comparing against (a) demonstrations collected using our web browser, (b) the highest-voted answer for each question. The amount of rejection sampling (the n in best-of-n) was chosen to be compute-efficient (see Figure 8). Error bars represent ±1 standard error.
Figure 3: TruthfulQA results. The amount of rejection sampling (the n in best-of-n) was chosen to be compute-efficient (see Figure 8). Error bars represent ±1 standard error.
In this work we leverage existing solutions to these components: we outsource document retrieval to the Microsoft Bing Web Search API, and utilize unsupervised pre-training to achieve high-quality synthesis by fine-tuning GPT-3. Instead of trying to improve these ingredients, we focus on combining them using more faithful training objectives. Following Stiennon et al 2020, we use human feedback to directly optimize answer quality, allowing us to achieve performance competitive with humans
We make 2 key contributions:
We create a text-based web-browsing environment that a fine-tuned language model can interact with. This allows us to improve both retrieval and synthesis in an end to end fashion using general methods such as imitation learning and reinforcement learning.
We generate answers with references: passages extracted by the model from web pages while browsing. This is crucial for allowing labelers to judge the factual accuracy of answers, without engaging in a difficult and subjective process of independent research.
…We use this data in 4 main ways: behavior cloning (ie. supervised fine-tuning) using the demonstrations, reward modeling using the comparisons, reinforcement learning against the reward model, and rejection sampling against the reward model. Our best model uses a combination of behavior cloning and rejection sampling. We also find reinforcement learning to provide some benefit when inference-time compute is more limited.
…We evaluate our best model in 3 different ways. First, we compare our model’s answers to answers written by our human demonstrators on a held-out set of questions. Our model’s answers are preferred 56% of the time, demonstrating human-level usage of the text-based browser. Second, we compare our model’s answers (with references stripped, for fairness) to the highest-voted answer provided by the ELI5 dataset. Our model’s answers are preferred 69% of the time. Third, we evaluate our model on TruthfulQA, an adversarial dataset of short-form questions. Our model’s answers are true 75% of the time, and are both true and informative 54% of the time, outperforming our base model (GPT-3), but falling short of human performance
Figure 1: An observation from our text-based web-browsing environment, as shown to human demonstrators (left) and models (right). The web page text has been abridged for illustrative purposes.
…Environment design: …For this approach, we designed a text-based web-browsing environment. The language model is prompted with a written summary of the current state of the environment, including the question, the text of the current page at the current cursor location, and some other information (see Figure 1(b)). In response to this, the model must issue one of the commands given in Table 1, which performs an action such as running a Bing search, clicking on a link, or scrolling around. This process is then repeated with a fresh context (hence, the only memory of previous steps is what is recorded in the summary).
While the model is browsing, one of the actions it can take is to quote an extract from the current page. When this is performed, the page title, domain name and extract are recorded to be used later as a reference. Browsing then continues until either the model issues a command to end browsing, the maximum number of actions has been reached, or the maximum total length of references has been reached. At this point, as long as there is at least one reference, the model is prompted with the question and the references, and must compose its final answer
Table 2: An answer produced by our 175B best-of-64 model to a randomly-chosen question from the ELI5 test set (not cherry-picked). The full text of the references can be found in Appendix J, along with answers from our human demonstrators and the ELI5 dataset. Further samples are available.
Field
Output
Question
Why did we decide that certain words were “bad” and shouldn’t be used in social settings?
Answer
We have all been taught that there are certain words that should not be used in public, at least not without good reason. But why do some words fall into this category while others do not? It turns out that there is not a single answer to this question, as each word has a different history and carries different meanings in different cultures and families [2]. However, there are some general categories that “bad” words tend to fall into. “Bad” words generally relate to parts of life that we don’t like talking about in public, like bathroom functions, or negative ways of talking about people’s religion, skin color, ability, or gender [3]. Often, people say bad words when they are experiencing strong emotions, and those words are linked to our emotions [3].
References (titles only)
[1, 2, 3] Why Are Some Words ‘Bad’? | Vermont Public Radio (www.vpr.org)
[4] On Words: ‘Bad’ Words and Why We Should Study Them | UVA Today (news.virginia.edu)
[5] The Science of Curse Words: Why The &@$! Do We Swear? (www.babbel.com)
Figure 5: Preference of the 175B best-of-n BC model over the BC model. The validation RM prediction is obtained using the estimator described in Appendix I, and predicts human preference well in this setting. The shaded region represents ±1 standard error.
…Our results are shown in Figure 2. Our best model, the 175B best-of-64 model, produces answers that are preferred to those written by our human demonstrators 56% of the time. This suggests that the use of human feedback is essential, since one would not expect to exceed 50% preference by imitating demonstrations alone (although it may still be possible, by producing a less noisy policy). The same model produces answers that are preferred to the reference answers from the ELI5 dataset 69% of the time. This is a substantial improvement over Krishna et al,2021 whose best model’s answers are preferred 23% of the time to the reference answers, although they use substantially less compute than even our smallest model.
Figure 6: BC scaling, varying the proportion of the demonstration dataset and parameter count of the policy.
…The combination of RL and rejection sampling also fails to offer much benefit over rejection sampling alone. One possible reason for this is that RL and rejection sampling are optimizing against the same reward model, which can easily be overoptimized (especially by RL, as noted above). In addition to this, RL reduces the entropy of the policy, which hurts exploration. Adapting the RL objective to optimize rejection sampling performance is an interesting direction for future research. It is also worth highlighting the importance of carefully tuning the BC baseline for these comparisons. As discussed in Appendix E, we tuned the number of BC epochs and the sampling temperature using a combination of human evaluations and reward model score. This alone closed much of the gap we originally saw between BC and RL.
Figure 7: RM scaling, varying the proportion of the comparison dataset and parameter count of the reward model.
…Scaling trends with dataset size and parameter count are shown in Figures 6 and 7. For dataset size, doubling the number of demonstrations increased the policy’s reward model score by about 0.13, and doubling the number of comparisons increased the reward model’s accuracy by about 1.8%. For parameter count, the trends were noisier, but doubling the number of parameters in the policy increased its reward model score by roughly 0.09, and doubling the number of parameters in the reward model increased its accuracy by roughly 0.4%.
Figure 8: Best-of-n scaling, varying the parameter count of the policy and reward model together, as well as the number of answers sampled.
…For rejection sampling, we analyzed how to trade off the number of samples against the number of model parameters for a given inference-time compute budget (see Figure 8). We found that it is generally compute-efficient to use some amount of rejection sampling, but not too much. The models for our main evaluations come from the Pareto frontier of this trade-off: the 760M best-of-4 model, the 13B best-of-16 model, and the 175B best-of-64 model. [cf. Jones 2021]
Large language models, prompted with in-context examples, can perform semantic parsing with little training data. They do better when we formulate the problem as paraphrasing into canonical utterances, which cast the underlying meaning representations into a controlled natural language-like representation. Intuitively, such models can more easily output canonical utterances as they are closer to the natural language used for pre-training. More recently, models also pre-trained on code, like OpenAI Codex, have risen in prominence. Since accurately modeling code requires understanding of executable semantics. such models may prove more adept at semantic parsing. In this paper, we test this hypothesis and find that Codex performs better at semantic parsing than equivalent GPT-3 models. We find that unlike GPT-3, Codex performs similarly when targeting meaning representations directly, perhaps as meaning representations used in semantic parsing are structured similar to code.
Large pre-trained language models such as GPT-3, Codex, and Google’s language model are now capable of generating code from natural language specifications of programmer intent. We view these developments with a mixture of optimism and caution. On the optimistic side, such large language models have the potential to improve productivity by providing an automated AI pair programmer for every programmer in the world. On the cautionary side, since these large language models do not understand program semantics, they offer no guarantees about quality of the suggested code.
In this paper, we present an approach to augment these large language models with post-processing steps based on program analysis and synthesis techniques, that understand the syntax and semantics of programs.
Further, we show that such techniques can make use of user feedback and improve with usage. We present our experiences from building and evaluating such a tool, Jigsaw, targeted at synthesizing code for using Python Pandas API using multi-modal inputs [text description, test cases, code].
Our experience suggests that as these large language models evolve for synthesizing code from intent, Jigsaw has an important role to play in improving the accuracy of the systems.
Program merging is standard practice when developers integrate their individual changes to a common code base. When the merge algorithm fails, this is called a merge conflict. The conflict either manifests in textual merge conflicts where the merge fails to produce code, or semantic merge conflicts where the merged code results in compiler or test breaks. Resolving these conflicts for large code projects is expensive because it requires developers to manually identify the sources of conflict and correct them.
In this paper, we explore the feasibility of automatically repairing merge conflicts (both textual and semantic) using k-shot learning with large neural language models (LM) such asGPT-3 [but not Codex/Copilot, or finetuned GPT-3]. One of the challenges in leveraging such language models is to fit the examples and the queries within a small prompt (2,048 tokens). We evaluate LMs and k-shot learning for 2 broad applications: (1) textual and semantic merge conflicts for a divergent forkMicrosoft Edge, and (2) textual merge conflicts for a large number of JavaScript projects in GitHub.
Our results are mixed: one one-hand, LMs provide the state-of-the-art (SOTA) performance on semantic merge conflict resolution for Edge compared to earlier symbolic approaches; on the other hand, LMs do not yet obviate the benefits of fine-tuning neural models (when sufficient data is available) or the design of special purpose domain-specific languages (DSL) for restricted patterns for program synthesis.
Figure 8: Accuracy Comparison for Gmerge on Resolving Semantic Merge Conflicts
…The evaluation shows that having a shot as the input to the language model substantially improves the results in all prompt structures. This meets our predictions because having a shot not only clearly pinpoints the current task to the model but also provides an example of what is the expected output from the model. Moreover, the evaluation shows that providing more conflicts related code changes as the context improves the accuracy of the model. It further shows that with the heuristics, Gmerge achieved the highest accuracy of 64.6%.
GPT-3 and GPT-J each output one resolution at one model trial. In our experiment, we repeatedly query GPT-3, and if the resolution is produced in any of the trials, we mark the merge conflict as “resolved”. We evaluated how the number of trials affect the model accuracy. Figure 8 shows that the overall accuracy of GPT-3 and GPT-J both increased with the number of model trials For example, for GPT-3, 10 independent trials achieves the accuracy of 64.6% in contrast to the accuracy of 37.2% with only one trial. Compared to the GPT-3 model, we only observed a modest accuracy gain in the GPT-J model. StringMerge and Transformation.Text have no accuracy gain because they produce a deterministic result in every run.
Are larger language models more accurate than smaller ones? One benefit of Gmerge is that its k-shot approach does not require expensive task specific fine-tuning. Thus, Gmerge can benefit from large scale language autoregressive models. In this section, we demonstrate that the size of the model has a substantial impact on Gmerge’s task specific accuracy. Figure 8 shows that the overall accuracy of GPT-3 increased more sharply than GPT-J with the increase of model queries. Approximately 30% of the additional merge conflicts are resolved if we query the GPT-3 model multiple times. In contrast, for GPT-J, only 5% of the additional merge conflicts can be resolved in this setting.
We solve university level probability and statistics questions by program synthesis using OpenAI’s Codex, a Transformer trained on text and fine-tuned on code. We transform course problems from MIT’s 18.05 Introduction to Probability and Statistics and Harvard’s STAT110 Probability into programming tasks. We then execute the generated code to get a solution. Since these course questions are grounded in probability, we often aim to have Codex generate probabilistic programs that simulate a large number of probabilistic dependencies to compute its solution. Our approach requires prompt engineering to transform the question from its original form to an explicit, tractable form that results in a correct program and solution. To estimate the amount of work needed to translate an original question into its tractable form, we measure the similarity between original and transformed questions. Our work is the first to introduce a new dataset of university-level probability and statistics problems and solve these problems in a scalable fashion using the program synthesis capabilities of large language models.
We solve MIT’s Linear Algebra 18.06 course and Columbia University’s Computational Linear Algebra COMS3251 courses with perfect accuracy by interactive program synthesis. This surprisingly strong result is achieved by turning the course questions into programming tasks and then running the programs to produce the correct answers. We use OpenAI Codex with zero-shot learning, without providing any examples in the prompts, to synthesize code from questions. We quantify the difference between the original question text and the transformed question text that yields a correct answer. Since all COMS3251 questions are not available online the model is not overfitting. We go beyond just generating code for questions with numerical answers by interactively generating code that also results visually pleasing plots as output. Finally, we automatically generate new questions given a few sample questions which may be used as new course content. This work is a significant step forward in solving quantitative math problems and opens the door for solving many university level STEM courses by machine.
There is burgeoning interest in designing AI-based systems to assist humans in designing computing systems, including tools that automatically generate computer code. The most notable of these comes in the form of the first self-described ‘AI pair programmer’, GitHub Copilot, a language model trained over open-source GitHub code. However, code often contains bugs—and so, given the vast quantity of unvetted code that Copilot has processed, it is certain that the language model will have learned from exploitable, buggy code. This raises concerns on the security of Copilot’s code contributions. In this work, we systematically investigate the prevalence and conditions that can cause GitHub Copilot to recommend insecure code. To perform this analysis we prompt Copilot to generate code in scenarios relevant to high-risk CWEs (eg. those from MITRE’s “Top 25” list). We explore Copilot’s performance on three distinct code generation axes—examining how it performs given diversity of weaknesses, diversity of prompts, and diversity of domains. In total, we produce 89 different scenarios for Copilot to complete, producing 1,692 programs. Of these, we found ~40% to be vulnerable.
This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes.
Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23,914 problems that evaluate the ability of the models to synthesize code from more complex text.
On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6% of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8% accuracy.
Going further, we study the model’s ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model’s initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.
We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.
I limited the investigation to Python suggestions with a cutoff on May 7, 2021 (the day we started extracting that data). That left 453,780 suggestions spread out over 396 “user weeks”, i.e. calendar weeks during which an user actively used GitHub Copilot on Python code.
…For most of GitHub Copilot’s suggestions, our automatic filter didn’t find any substantial overlap with the code used for training. But it did bring 473 cases to our attention. Removing the first bucket (cases that look very similar to other cases) left me with 185 suggestions. Of these, 144 got sorted out in buckets 2—4. This left 41 cases in the last bucket, the “recitations”, in the meaning of the term I have in mind.
That corresponds to 1 recitation event every 10 user weeks (95% confidence interval: 7—13 weeks, using a Poisson test).
…This investigation demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code that everybody quotes, and mostly at the beginning of a file, as if to break the ice.
…The answer is obvious: sharing the pre-filtering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether. This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.
[If you find ‘1 possible copy every 10 man-weeks’ concerning, you’d better not look too hard at the overlap of your own codebase with StackExchange/GitHub or the licensing requirements (especially attribution) of all code therein…]
GitHub, GitHub’s parent company Microsoft and OpenAI have teamed up to deliver a tool that comes up with source code for programmers to use as they work.
The system can make recommendations in almost any programming language, although it works best with the popular JavaScript, Python and TypeScript languages.
OpenAI will release the underlying online service for other companies to use this summer.
…The new software makes coding faster, Friedman said in an interview last week. Hundreds of developers at GitHub have been using the Copilot feature all day while coding, and the majority of them are accepting suggestions and not turning the feature off, Friedman said.
…“You don’t want to go read Twilio’s API documentation. It knows all that stuff. It’s actually quite reliable at it”, he said. Brockman calls this work last-mile programming, and he said that having computers take care of it leads to speed improvements. Microsoft’s chief technology officer, Kevin Scott, has seen that happen firsthand. “It can save me from having to dive through a whole bunch of documentation to get a tool to do a thing that I know it’s capable of doing, and that is so good for productivity”, he said. “I can’t even tell you the number of hours I’ve wasted trying to figure out the right way to do a relatively prosaic thing, just navigating the complexity of these tools.”
…It supports almost every programming language, but it’s been designed to work best with JavaScript, Python and TypeScript, Friedman said. GitHub Copilot will first appear in Microsoft’s Visual Studio Code, a free open-source product, and Microsoft plans to incorporate it into the commercial Visual Studio product in the future…The underlying technology won’t be only Microsoft’s to use. OpenAI will release the Codex model this summer for third-party developers to weave into their own applications, Brockman said.
…Engineers fed the model “many, many terabytes of public source code out there”, Friedman said.
…Over the next few months, people experimented with the model to see what it could do, both useful and silly—for instance, one engineer made a website that could design a button that looked like a watermelon. Brockman reached out to Friedman, as he was running a key destination where millions of programmers work on code and things proceeded from there.
Software language models have achieved promising results predicting code completion usages, and several industry studies have described successful IDE integrations. Recently, accuracy in autocompletion prediction improved 12.8% from training on a real-world dataset collected from programmers’ IDE activity. But what if limited examples of IDE autocompletion in the target programming language are available for model training?
In this paper, we investigate the efficacy of pretraining autocompletion models on non-IDE, non-autocompletion, and different-language example code sequences.
We find that these unsupervised pretrainings improve model accuracy by over 50% on very small fine-tuning datasets and over 10% on 50k labeled examples. We confirm the real-world impact of these pretrainings in an online setting through A/B testing on thousands of IDE autocompletion users, finding that pretraining is responsible for increases of up to 6.63% autocompletion usage.
Code completion is a popular software development tool integrated into all major IDEs. Many neural language models have achieved promising results in completion suggestion prediction on synthetic benchmarks. However, a recent study “When Code Completion Fails: a Case Study on Real-World Completions” demonstrates that these results may not translate to improvements in real-world performance.
To combat this effect, we train models on real-world code completion examples and find:
that these models outperform models trained on committed source code and working version snapshots by 12.8% and 13.8% accuracy respectively. We observe this improvement across modeling technologies and show through A/B testing that it corresponds to a 6.2% increase in programmers’ actual autocompletion usage.
Furthermore, our study characterizes a large corpus of logged autocompletion usages to investigate why training on real-world examples leads to stronger models.
Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of “where-the-value-comes-from” between variables. Such a semantic-level structure is neat and does not bring an unnecessarily deep hierarchy of AST, the property of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search.
In software development through integrated development environments (IDEs), code completion is one of the most widely used features. Nevertheless, majority of integrated development environments only support completion of methods and APIs, or arguments.
In this paper, we introduce IntelliCode Compose—a general-purpose multilingual code completion tool which is capable of predicting sequences of code tokens of arbitrary types, generating up to entire lines of syntactically correct code. It leverages state-of-the-art generative transformer model trained on 1.2 billion lines of source code in Python, C#, JavaScript and TypeScript programming languages.
IntelliCode Compose is deployed as a cloud-based web service. It makes use of client-side tree-based caching, efficient parallel implementation of the beam search decoder, and compute graph optimizations to meet edit-time completion suggestion requirements in the Visual Studio Code IDE and Azure Notebook.
Our best model yields an average edit similarity of 86.7% and a perplexity of 1.82 for Python programming language.
Semantic code search is the task of retrieving relevant code given a natural language query. While related to other information retrieval tasks, it requires bridging the gap between the language used in code (often abbreviated and highly technical) and natural language more suitable to describe vague concepts and ideas.
To enable evaluation of progress on code search, we are releasing the CodeSearchNet Corpus and are presenting the CodeSearchNet Challenge, which consists of 99 natural language queries with about 4k expert relevance annotations of likely results from CodeSearchNet Corpus. The corpus contains about 6 million functions from open-source code spanning six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby). The CodeSearchNet Corpus also contains automatically generated query-like natural language for 2 million functions, obtained from mechanically scraping and preprocessing associated function documentation. In this article, we describe the methodology used to obtain the corpus and expert labels, as well as a number of simple baseline solutions for the task.
We hope that CodeSearchNet Challenge encourages researchers and practitioners to study this interesting task further and will host a competition and leaderboard to track the progress on the challenge. We are also keen on extending CodeSearchNet Challenge to more queries and programming languages in the future.
Code super-optimization is the task of transforming any given program to a more efficient version while preserving its input-output behaviour. In some sense, it is similar to the paraphrase problem from natural language processing where the intention is to change the syntax of an utterance without changing its semantics. Code-optimization has been the subject of years of research that has resulted in the development of rule-based transformation strategies that are used by compilers. More recently, however, a class of stochastic search based methods have been shown to outperform these strategies. This approach involves repeated sampling of modifications to the program from a proposal distribution, which are accepted or rejected based on whether they preserve correctness, and the improvement they achieve. These methods, however, neither learn from past behaviour nor do they try to leverage the semantics of the program under consideration. Motivated by this observation, we present a novel learning based approach for code super-optimization. Intuitively, our method works by learning the proposal distribution using unbiased estimators of the gradient of the expected improvement. Experiments on benchmarks comprising of automatically generated as well as existing (“Hacker’s Delight”) programs show that the proposed method is able to significantly outperform state of the art approaches for code super-optimization.
We develop a first line of attack for solving programming competition-style problems from input-output examples using deep learning. The approach is to train a neural network to predict properties of the program that generated the outputs from the inputs. We use the neural network’s predictions to augment search techniques from the programming languages community, including enumerative search and an SMT-based solver. Empirically, we show that our approach leads to an order of magnitude speedup over the strong non-augmented baselines and a Recurrent Neural Network approach, and that we are able to solve problems of difficulty comparable to the simplest problems on programming competition websites.