BIG-Bench Mistake: Implementational Details That Are Important

1 Jun 2024


(1) Gladys Tyen, University of Cambridge, Dept. of Computer Science & Technology, ALTA Institute, and Work done during an internship at Google Research (e-mail:;

(2) Hassan Mansoor, Google Research (e-mail:;

(3) Victor Carbune, Google Research (e-mail:;

(4) Peter Chen, Google Research and Equal leadership contribution (;

(5) Tony Mak, Google Research and Equal leadership contribution (e-mail:

Abstract and Introduction

BIG-Bench Mistake

Benchmark results


Related Works

Conclusion, Limitations, and References

A. Implementational details

B. Annotation

C. Benchmark scores

A Implementational details

A.1 3-shot CoT prompting to generate traces for BIG-Bench Mistake

We use PaLM 2 L (Unicorn) to generate the traces used in BIG-Bench Mistake. All traces are generated at temperature = 0.

Our prompts and examples can be found at BIG-Bench-Mistake. Our prompts are based on chain-of-thought prompts in the BIG-Bench Hard dataset (Suzgun et al., 2022), with four main changes:

1. Example CoT traces in the prompt is broken up into smaller steps (typically one sentence per step). This is done so that mistake location information is more precise.

2. Following Yao et al. (2022), each step in the prompt is signposted with “Thought 1”, “Thought 2:”, etc. This allows us to refer to the number of the step when prompting for mistake location.

3. For the logical deduction task, we find that the notation used in the original prompt with question marks is often inconsistent. It becomes difficult for annotators to determine whether a question mark is a mistake or not, because the correctness of the question mark is dependent on its interpretation. To minimise such ambiguity, the question mark notation is rewritten into text.

4. For the multistep arithmetic task, one of the prompt examples is altered to increase the length of the equation. This is because the BIG-Bench Hard dataset (where the prompts are taken from) only used equations of a specific length, but our dataset contains equations of averaged a variety of lengths, in accordance with the original BIG-Bench dataset (Srivastava et al., 2022)

Following Yao et al. (2022), we use the newline as the stop token, which generates one step with every generation call. We algorithmically append “Thought N:” before each step. This allows us to split up steps in a clear and systematic way. We stop generating once an answer is reached, which is detected using the following regex: (?<=[Tt]he answer is).*$

A.2 3-shot prompting to identify mistakes in BIG-Bench Mistake

As described in section 3, we explore three different methods of prompting for mistake location: direct trace-level prompting, direct step-level prompting, and CoT step-level prompting. We use 3-shot prompting for all methods, and our prompts and examples can be found at WHGTyen/BIG-Bench-Mistake.

Our prompts follow OpenAI’s chat completion format. All results were obtained with temperature = 0 and no stop tokens.

This paper is available on arxiv under CC 4.0 license.