HellaSwag is a dataset and benchmark designed to evaluate a model's ability to perform commonsense natural language inference (NLI), specifically in the setting of sentence completion. It was introduced in a 2019 paper by Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi titled "HellaSwag: Can a Machine Really Finish Your Sentence?" The name is a portmeme of "hell" and "swag", where "swag" stands for "Situations With Adversarial Generations".
How it works: The benchmark provides a context (a premise or partial description of a situation) and four possible endings. The task is to select the most plausible ending that follows from the context. The key innovation is that the incorrect endings (distractors) are generated using a language model (initially OpenAI's GPT-2) and then filtered to be adversarial: they are grammatically correct and semantically plausible on the surface but entail a violation of commonsense physical or social knowledge. For example, given "A woman is using a drill to make a hole in the wall," a correct ending might be "She pulls out the drill and inspects the hole." An adversarial ending might be "She drills a hole in the wall and then jumps into the pool." The dataset contains 70,000 examples across a wide range of everyday activities, including cooking, sports, household chores, and professional tasks, drawn from video captions (ActivityNet) and wikiHow articles.
Technical details: The benchmark is used to measure a model's grasp of commonsense reasoning—the implicit understanding of how the physical and social world typically works. Models are evaluated by accuracy on the multiple-choice task, which is often framed as a scoring problem: given the context and each ending, the model assigns a log-probability (or a score from a classification head) and picks the highest. Because the distractors are generated by a model that itself has learned some commonsense, the task is harder than earlier NLI benchmarks like SNLI or MultiNLI, which often had obvious or easily distinguishable incorrect answers. HellaSwag correlates strongly with human performance: humans achieve around 95% accuracy, while models have historically lagged behind.
Why it matters: HellaSwag became a standard benchmark for gauging progress in commonsense reasoning, a critical capability for any AI system that must interact with the real world. It is one of the core evaluations in the Open LLM Leaderboard (originally hosted by Hugging Face) and is used by major model releases (e.g., Llama, Mistral, GPT-4) to report reasoning ability. The benchmark is notable for being "saturated"—state-of-the-art models now exceed 90% accuracy, with GPT-4 and Llama 3.1 405B reportedly achieving over 95%, approaching human parity. This has led to discussions about whether HellaSwag is still a useful discriminator, as models begin to saturate it, though it remains a valuable sanity check.
When it's used vs alternatives: HellaSwag is used specifically for evaluating commonsense reasoning in grounded, everyday situations. Alternatives include:
- WinoGrande: Tests pronoun resolution requiring commonsense knowledge (e.g., "The trophy doesn't fit in the suitcase because it is too big." — what is too big?).
- PIQA: Focuses on physical commonsense (e.g., how to use objects).
- ARC (AI2 Reasoning Challenge): Tests grade-school science knowledge.
- MMLU: Covers broader knowledge across 57 subjects, including social sciences and law.
- BigBench: A collection of many tasks, some overlapping with HellaSwag.
HellaSwag is preferred when one wants a focused, high-quality, adversarial test of everyday commonsense understanding.
Common pitfalls: (1) Models may exploit statistical patterns in the dataset (e.g., length bias: correct endings tend to be slightly longer or have certain syntactic structures). (2) The dataset is English-only and Western-centric, reflecting the activities in wikiHow and ActivityNet captions. (3) As models saturate the benchmark, it becomes less useful for differentiating between top-tier models, forcing researchers to look at harder subsets or newer benchmarks like HellaSwag-v2 (a harder version with more adversarial distractors).
Current state of the art (2026): As of 2026, most frontier models (e.g., GPT-5, Gemini Ultra, Claude 4, Llama 4) achieve 96-98% accuracy on HellaSwag, effectively saturating the benchmark. The research community has largely moved to harder commonsense benchmarks such as HellaSwag-v2, which uses more advanced language models (e.g., GPT-4) to generate even more adversarial distractors, and to multimodal commonsense benchmarks (e.g., evaluating video comprehension). HellaSwag remains a standard baseline for new model releases, but its role as a discriminator has diminished.