Scarecrow

A framework for scrutinizing machine text

Notice: In the dataset, the special tokens _SEP_ and _QUOTE_ in the error explanations represent , and " respectively.

Abstract

Modern neural text generation systems can produce remarkably fluent and grammatical texts. While earlier language models suffered from repetition and syntactic errors, the errors made by contemporary models are often semantic, narrative, or discourse failures.

To facilitate research of these complex error types, we introduce a new structured, crowd-sourced error annotation schema called scarecrow. The error categories used in scarecrow—such as redundancy, commonsense errors, and incoherence—were were identified by combining expert analysis with several pilot rounds of ontology-free crowd annotation to arrive at a schema which covers the error phenomena found in real machine generated text.

We use scarecrow to collect 13k annotations of 1.3k human and machine generate paragraphs of English language news text, amounting to over 41k spans each labeled with its error category, severity, a natural language explanation, and antecedent span (where relevant). We collect annotations for text generated by state-of-the-art systems with varying known performance levels—from GPT-2 Small through the largest GPT-3. We isolate several factors for detailed analysis, including parameter count, training data, and decoding technique. Our results show both expected and surprising differences across these settings. These findings demonstrate the value of scarecrow annotations in the assessment of current and future text generation systems.

Error Types

This table summarizes the 10 error types that annotators choose from to identify problems in text.

Error Type Definition Example
Grammar and Usage This category of errors includes missing words, extra words, and incorrect or out of order words. A PhD student from the University of Kent in the UK, claims to have discovered a clever way to explain the positive emoticons in cats.
Redundant Redundant text repeats itself. Sometimes, you will see the exact word or phrase repeated. Other times, the same idea is repeated using different words. They then made decisions based on Kondo’s instructions, to the extent that they created de-cluttered spaces and got rid of clutter and clutter-filled spaces.
Off-prompt Prompt is a piece of text written by a human that the AI is supposed to continue. Sometimes, however, the AI will write a phrase or sentence that is completely unrelated to the prompt. Other times, the text might be related, but it contradicts the prompt. Prompt: China sets new record for Economic Growth

Text: The Chinese economy fell 10% this month, the third such loss this year.
Self-Contradiction It occurs when the AI writes something that contradicts another piece of text that the AI had previously written. McDonald's is considering a design which will replace the cardboard packaging. Mr Gore-Cotter said: "We recognise the concern around waste. We are now looking at a new design that minimises the plastic bag."
Incoherent The text that doesn't fit into the above categories, but it still just doesn’t make any sense all. Cats naturally show anxiety and fear by at times breaking apart different parts of the brain in an attempt to keep the others from escaping.
Technical Jargon The jargon or specific words from a field you’re not familiar with. In Chile, an 800-megawatt photovoltaic plant was built for a record low cost of $129 per megawatt-hour last year.
Needs Google When there’s a fact or figure that you suspect might be true, but you would need to Google it to be sure. It was promoted by Dr. Michael Fanning, the Executive Director of the Foundation for Mental Health Awareness, Inc. 
Bad Math Bad math includes problems with basic math (+ - ✖️ ÷), problems converting fixed units, and problems converting currencies that are wildly impossible (e.g., 1$ = 10£). One account, @Iain_Rowling1, had over 500,000 followers at one point, but in just four days they fell by around half – some 4,000. 
Commonsense Text that violates our everyday basic understanding of the world. The picture is from high above the South Pole, where close to 100,000 Astronauts live and work.
Encyclopedic Text that are just plain factually wrong where the correct information is written down in a fact table somewhere, like a textbook, a wikipedia sidebar, or an encyclopedia. Japanese Prime Minister Justin Trudeau said he will be halting all imports and exports until the current situation can be contained.

Browse Dataset

We annotate over 1,300 paragraphs with 10 annotators each. To preview an annotation, choose a model and then a prompt. The continuation will appear on the left, and the annotations on the right. You can scroll the annotations window. Select an annotator, then hover over the errors they marked to highlight where they are in the continuation.

Example id: 843

Prompt (human-written):

Ali Carter is into his first ever Masters final (Picture: Getty Images) Ali Carter has taken his fairy-tale run at the Masters all the way to the final with a 6-3 win over Shaun Murphy in the semis at Alexandra Palace.

Continuation written by GPT-3 (argmax, frequency penalty = 1):

The 37-year-old Carter, who was diagnosed with cancer in 2012 and has been given the all clear, will face either Judd Trump or Mark Selby in Sunday's final. Carter had to come from behind to beat Murphy after the world number four raced into a 3-0 lead. But Carter won five of the next six frames to set up a meeting with either Trump or Selby on Sunday afternoon. 'I'm over the moon,' said Carter. 'It's been an amazing week for me and I'm just so happy that I've got through.'


Needs Google (1): Check that he actually had cancer in 2012Needs Google (1): Check if these are the other finalistsNeeds Google (1): Check if Murphy was number four in the world for this sportGrammar / Usage (1): worlds number four instead of world.Technical Jargon (2): I'm not familiar with this sport so I don't know how accurate this is

Needs Google (1): Is he 37 and been diagnosed with cancer?Off-prompt (1): Prompt is about making it into the finals not his healthNeeds Google (1): Are these competitors in the Masters.Technical Jargon (3): I don't know about score keeping in this sport, so these numbers may be right, but they don't seem right to me since he won 6-3.Grammar / Usage (1): Need a comma after me.

Needs Google (1): How old is Carter?Needs Google (1): This is a factual statement that needs google to confirm if he was diagnosed with cancer and which year it was. Needs Google (1): Is Murphy the world number four?

Wrong: Encyclopedic (2): this doesn't give the final score of 6 to 3 that the prompt says happened.

no problems found

Redundant (1): information is repeated

Needs Google (1): Is the age correct?Needs Google (2): Is this true?Needs Google (1): I need to google the competitor's names.Needs Google (2): The writing is vague. Who is the world number four?Needs Google (1): I need to google Carter's winning history.

Grammar / Usage (1): The phrase "set up a meeting with" needs to be replaced by something like "win a chance to challenge".

Needs Google (2): This is about a game and sports, so I need Google to verify.Grammar / Usage (1): a hyphen should go between these two words

Needs Google (2): Is this information accurate?Grammar / Usage (2): But shouldn't begin a sentence. Needs Google (2): Is this information accurate?Technical Jargon (2): I'm not sure what six frames means? It is probably specific to this sport.

Annotation Tool

Here you can demo the annotation tool used by crowd workers to annotate the dataset. Click and drag on any words in the continuation to trigger the annotation popup. As you make annotations, they will appear below the continuation, where you can interact with them further. (Works on desktop computers only.)

Sytem Prompt (Human Written):

Police, the canine unit and members of the community began searching for Luke.

Continuation (Human or AI Model Written):

The search was called off at around 11 p.m. and resumed at 7 a.m. on Thursday. Police said the search was suspended again at around 9 p.m. on Thursday. Police said the search resumed at 7 a.m. on Friday. Police said the search was suspended again at around 9 p.m. on Friday. Police said the search resumed at 7 a.m. on Saturday. Police said the search was suspended again at around 9 p.m. on Saturday. Police said the search resumed at 7 a.m. on Sunday.

Citation

If our work inspires you, please consider citing our paper.

@misc{dou2021scarecrow,
    title={Scarecrow: A Framework for Scrutinizing Machine Text},
    author={Yao Dou and Maxwell Forbes and Rik Koncel-Kedziorski and Noah A. Smith and Yejin Choi},
    year={2021},
    eprint={2107.01294},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}