Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Evaluation

TruthfulQA: definition + examples

TruthfulQA is a benchmark designed to evaluate the truthfulness of large language models (LLMs) by probing their tendency to reproduce common misconceptions and false beliefs. Introduced in a 2022 paper by Stephanie Lin, Jacob Hilton, and Owain Evans, the dataset consists of 817 questions spanning 38 categories, including law, health, finance, and conspiracy theories. Each question is crafted to trigger a false answer that is frequently found online, such as "Is the Earth flat?" or "Do vaccines cause autism?" — statements that are false but widely believed. The benchmark includes both a generation task, where the model produces a free-text answer, and a multiple-choice task, where it selects among options. Evaluation metrics include truthfulness (whether the answer is factually correct) and informativeness (whether it provides useful information beyond a refusal). TruthfulQA is particularly challenging because it tests not just factual recall but also the model's ability to resist mimicking human-generated falsehoods present in its training data. It is used as a standard evaluation in model releases (e.g., GPT-4, Llama 2, Claude 3) to gauge alignment and safety. As of 2026, state-of-the-art models such as GPT-4 Turbo and Claude 3.5 Sonnet achieve around 70-80% truthfulness on the multiple-choice variant, but still struggle with free-form generation, often scoring below 50% on truthfulness without explicit prompting for honesty. Common pitfalls include models overly refusing to answer (which boosts truthfulness at the cost of informativeness) and the benchmark's limited coverage of nuanced factual domains. Alternatives include HaluEval (for hallucination detection) and RealTimeQA (for temporal grounding), but TruthfulQA remains a key diagnostic for model honesty and is often used alongside safety benchmarks like BBQ and Winogender for bias evaluation.

Examples

  • GPT-4 achieved 72% truthfulness on TruthfulQA multiple-choice in its technical report (2023), compared to 47% for GPT-3.5.
  • Llama 2 (70B) scored 57% truthfulness on the generation task, while Llama 3 (70B) improved to 63%.
  • Claude 3 Opus reached 79% truthfulness on the multiple-choice set, the highest among major models as of early 2024.
  • Fine-tuning GPT-3 with RLHF reduced truthfulness by 5 percentage points on TruthfulQA, highlighting a trade-off between helpfulness and honesty.
  • Google's Gemini 1.5 Pro scored 68% on TruthfulQA generation, with notable failures on questions about medical myths and historical falsehoods.

Related terms

HaluEvalBBQWinogenderAlignmentHallucination

Latest news mentioning TruthfulQA

FAQ

What is TruthfulQA?

TruthfulQA is a benchmark that measures whether large language models generate truthful answers by testing for common misconceptions and false beliefs across 38 categories.

How does TruthfulQA work?

TruthfulQA is a benchmark designed to evaluate the truthfulness of large language models (LLMs) by probing their tendency to reproduce common misconceptions and false beliefs. Introduced in a 2022 paper by Stephanie Lin, Jacob Hilton, and Owain Evans, the dataset consists of 817 questions spanning 38 categories, including law, health, finance, and conspiracy theories. Each question is crafted to trigger…

Where is TruthfulQA used in 2026?

GPT-4 achieved 72% truthfulness on TruthfulQA multiple-choice in its technical report (2023), compared to 47% for GPT-3.5. Llama 2 (70B) scored 57% truthfulness on the generation task, while Llama 3 (70B) improved to 63%. Claude 3 Opus reached 79% truthfulness on the multiple-choice set, the highest among major models as of early 2024.