A single tweet from Symbolica's Kimmo Kärkkäinen has ignited discussion in the AI benchmarking community. According to the post, Symbolica's Agentica SDK achieved a score of 36.08% on the newly released ARC-AGI-3 benchmark. The claim is that this result was achieved in a single day and at "a fraction of what brute-forcing it with a frontier model would cost."
What Happened
On May 31, 2025, Kimmo Kärkkäinen, co-founder of Symbolica, announced on X (formerly Twitter) that the company's Agentica SDK had scored 36.08% on the ARC-AGI-3 benchmark. The ARC-AGI-3 benchmark is a successor to the Abstraction and Reasoning Corpus (ARC) challenge, designed by François Chollet to test an AI system's ability to perform abstract reasoning on novel tasks, a key hurdle on the path to artificial general intelligence (AGI).
The core claim is not just the score, but the method and efficiency. Kärkkäinen states the result was achieved using an "agentic approach" that "dramatically outperformed throwing raw compute at the problem." This implies a contrast between Symbolica's structured, multi-step reasoning process and simply prompting a large language model (LLM) like GPT-4 or Claude 3 Opus with massive context and compute.
Context: The ARC-AGI Benchmark
The original ARC challenge, created in 2019, presents a set of unique visual reasoning puzzles. Each puzzle consists of a few input-output examples, and the system must infer the underlying transformation rule to produce the correct output for a new, unseen input. It is notoriously difficult for current AI systems, with top scores historically hovering in the 20-30% range for the public leaderboard, though recent proprietary models have claimed higher results.
ARC-AGI-3 is the latest iteration. A high score on this benchmark suggests a system can perform robust, few-shot abstract reasoning without relying on memorized patterns—a capability that remains a significant weakness for even the most advanced LLMs.
The tweet concludes with Kärkkäinen stating he is "Really eagier [sic] to see it being verified," acknowledging the community's need for independent validation of such claims. The parenthetical "(p.s.: I saw the debate about harnessing)" likely references ongoing discussions about the ethics and methods of controlling or "harnessing" advanced AI systems.
The Claimed Advantage: Agentic vs. Brute-Force
The most provocative part of the announcement is the claimed cost efficiency. "Brute-forcing" with a frontier model typically involves using a model like GPT-4 with extensive chain-of-thought prompting, search, and iteration—a process that consumes significant API costs and compute time. Symbolica asserts its Agentica SDK, presumably a framework for building and orchestrating specialized AI agents, can achieve a competitive result far more cheaply.
This aligns with a growing industry trend: moving from monolithic, expensive model calls to optimized, purpose-built agentic workflows that break complex problems into smaller, cheaper steps. If verified, it would be a strong argument for the economic viability of agentic architectures over pure scale.
What We Don't Know Yet
Crucially, this is an announcement, not a published result. As of this writing:
- The ARC-AGI-3 benchmark details and full leaderboard are not publicly available for cross-checking.
- Symbolica has not released a technical paper, methodology, or cost analysis.
- The baseline "brute-force" cost and the specific "fraction" of cost claimed are undefined.
- The architecture of the Agentica SDK and the agents used is unspecified.
Independent verification, as mentioned by Kärkkäinen himself, is the essential next step to assess the true significance of this result.
gentic.news Analysis
This announcement sits at the intersection of three critical trends we've been tracking. First, it's a direct challenge in the abstract reasoning arena, a domain where companies like Google DeepMind (with its Alpha series and Gemini) and Anthropic (with Claude 3) have invested heavily. A verified 36% score would be a notable competitive entry, suggesting Symbolica's niche architectural research is bearing fruit.
Second, it aggressively promotes the agentic paradigm. This follows a wave of investment and product launches focused on AI agents, from OpenAI's GPTs and Assistant API to startups like Cognition Labs (Devon) and Magic.dev. Symbolica's claim that its agentic system beats a brute-force frontier model on cost-performance is a potent marketing message aimed at developers tired of soaring API bills. It suggests the next phase of AI utility may be won by clever orchestration, not just larger models.
Third, the call for verification is apt. The AI field has a history of bold benchmark claims that later face scrutiny over evaluation methods or generalization. The credibility of ARC-AGI-3 itself will be tested by how it handles submissions like this. If Symbolica's result holds, it could validate ARC-AGI-3 as a meaningful new yardstick and force other players to disclose their agentic strategies and costs, not just their final scores.
Frequently Asked Questions
What is the ARC-AGI-3 benchmark?
ARC-AGI-3 is the latest version of a benchmark designed to test an AI system's ability for abstract reasoning and core knowledge generalization. It presents unique visual puzzle tasks where the system must infer a transformation rule from a few examples and apply it to a new input. High performance indicates a move beyond pattern recognition towards more human-like reasoning.
What is Symbolica's Agentica SDK?
Based on the announcement, the Agentica SDK appears to be Symbolica's framework for building and deploying AI agents. While technical details are not public, the claim suggests it uses an orchestrated, multi-step agentic approach to solve complex problems like ARC-AGI-3, as opposed to making a single, expensive call to a large frontier model.
How significant is a 36.08% score on ARC-AGI-3?
Without a public leaderboard or historical context for ARC-AGI-3, it's difficult to say precisely. However, given the extreme difficulty of the original ARC challenge, where scores above 30% were considered very strong, a 36% score on a more advanced version would be a highly competitive result. Its significance is amplified by the claim of achieving it cost-effectively in a single day.
Has this result been verified?
No. As explicitly noted in the source tweet, this is an initial announcement awaiting independent verification. The AI research community typically requires results to be reproduced on a public benchmark with a detailed methodology before accepting them as established fact. The next step is for Symbolica to publish its method and for others to test the Agentica SDK on the official ARC-AGI-3 evaluation.





