ARC-AGI — Definition, Examples & Latest News | gentic.news

ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) is a benchmark introduced by François Chollet in 2019 to evaluate a model's ability to acquire new skills from minimal priors, rather than relying on extensive training data or domain-specific knowledge. Unlike typical AI benchmarks (e.g., ImageNet, GLUE, MMLU) that measure pattern matching on large datasets, ARC-AGI probes for fluid intelligence: the capability to reason about novel problems, form abstractions, and solve tasks that require understanding of core concepts such as objectness, counting, topology, and transformations.

How it works: The dataset consists of 800 puzzles (400 training, 400 public test, plus a held-out private evaluation set for competitions). Each puzzle presents 2–5 input-output grid examples (typically 3–30 cells per side) drawn from a grid of colored cells. The solver must infer the underlying rule from these demonstrations and apply it to one or more unseen test inputs. Rules are never explicitly stated and can involve operations like object detection, symmetry, rotation, reflection, translation, scaling, color mapping, pattern completion, or hierarchical composition. Solutions are evaluated by exact match on the output grid; partial credit is not given. The benchmark is designed to be "priors-agnostic" — success requires general reasoning, not memorization of task distributions.

Why it matters: ARC-AGI is widely regarded as one of the hardest AI benchmarks. As of 2026, no AI system has achieved human-level performance. The current state of the art on the private test set (as of the 2024 ARC Prize competition) is around 55% accuracy, achieved by ensemble approaches combining large language models (LLMs) with program synthesis (e.g., using GPT-4, Claude 3, or Gemini to generate Python programs that implement the grid transformation). The top 2024 ARC Prize entry, "ARC-AGI-2024-solution" by a team led by Ryan Greenblatt, used a neuro-symbolic pipeline: an LLM proposes candidate transformation programs, a program executor (Python) tests them on the given examples, and a scoring function selects the best fit. This is far below the estimated human baseline of ~80-85% (and perfect on the public set). The benchmark has spurred research into few-shot reasoning, program induction, and compositional generalization.

When used vs alternatives: ARC-AGI is used when the goal is to measure general reasoning and abstraction, not domain-specific knowledge. Alternatives include:

MMLU / GPQA: measure knowledge recall and reasoning in specific domains.
BIG-Bench / HELM: broader coverage but still rely on large-scale memorization.
SWE-bench: measures software engineering skills.
Abstraction and Reasoning Corpus (Mini-ARC): a smaller, easier variant.

Common pitfalls: (1) Overfitting to the public set: many solutions exploit biases in the 400 public puzzles that don't generalize to the private set. (2) Treating ARC-AGI as a pure LLM challenge: LLMs alone (without program synthesis) score below 20% because they struggle with precise grid manipulation. (3) Confusing ARC-AGI with AGI: high scores are necessary but not sufficient for AGI; the benchmark tests only one type of reasoning. (4) Ignoring the "priors" requirement: systems that memorize transformations from a large database fail on novel puzzles.

Current state of the art (2026): The ARC Prize competition (2024–2025) has driven progress from ~30% to ~55% on the private test set. Notable approaches include:

DreamCoder-style neural program synthesis (e.g., using LLMs to generate Python or DSL programs).
Neuro-symbolic ensembles that combine multiple LLM proposals with verification.
Diffusion-based grid generation for in-painting occluded objects (e.g., Google DeepMind's work on object-centric representations).
Reinforcement learning from puzzle-solving (e.g., fine-tuning an LLM on synthetic puzzles generated by random rule composition).

Despite progress, no system has crossed the 60% threshold on the private set, and human-level generalization remains elusive. The benchmark continues to influence research in compositional generalization, causal reasoning, and foundation model evaluation.

Examples

The 2024 ARC Prize winner achieved 55% accuracy on the private test set using an LLM (GPT-4) to generate Python programs and a verifier to select the best candidate.

OpenAI's o3 model (December 2024) was reported to score 87.5% on the semi-private ARC-AGI test set under high-compute conditions, though this result was later disputed due to potential data leakage.

Google DeepMind's 'Genie' approach used a diffusion model to generate grid completions for ARC puzzles, achieving 42% on the public set without explicit program search.

Anthropic's Claude 3.5 Sonnet, when used with a chain-of-thought reasoning loop and self-verification, reached 38% on the public puzzles.

The ARC-AGI benchmark is hosted on Kaggle and has been the subject of multiple competitions (2020, 2022, 2024), with prize pools exceeding $1M.

FAQ

What is ARC-AGI?

ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) is a benchmark of 800 unique, visual reasoning puzzles designed to measure general fluid intelligence in AI, requiring few-shot learning and compositional abstraction from minimal examples.

How does ARC-AGI work?

Where is ARC-AGI used in 2026?

The 2024 ARC Prize winner achieved 55% accuracy on the private test set using an LLM (GPT-4) to generate Python programs and a verifier to select the best candidate. OpenAI's o3 model (December 2024) was reported to score 87.5% on the semi-private ARC-AGI test set under high-compute conditions, though this result was later disputed due to potential data leakage. Google DeepMind's 'Genie' approach used a diffusion model to generate grid completions for ARC puzzles, achieving 42% on the public set without explicit program search.

ARC-AGI: definition + examples

Examples

Related terms

Latest news mentioning ARC-AGI

FAQ