A Miami startup claims it processed 12 million tokens through its LLM for $8. The same job costs $2,600 on Anthropic's Claude Opus 4.6, per the company.
Key facts
- Startup claims 12M tokens processed for $8 total
- Same job on Claude Opus 4.6 estimated at $2,600
- Claims 325x cost reduction over Anthropic
- Claude Opus 4.6 supports 200K-token context window
- No paper, code, or benchmarks released yet
A Miami-based startup, whose name has not been disclosed in the available reporting, claims it ran a 12-million-token inference job on its own large language model for $8 — a 325x cost reduction compared to the $2,600 it says the same input would cost on Anthropic's Claude Opus 4.6. According to Towards AI The company says it solved the quadratic attention scaling problem that has constrained transformer context windows since Vaswani et al. 2017, enabling linear-time inference over arbitrarily long sequences.
The Cost Comparison
Anthropic's Claude Opus 4.6, the company's most capable model, supports a 200K-token context window and costs $75 per million input tokens. Scaling that to 12 million tokens — 60x the native context limit — would require chunking, retrieval-augmented generation, or multiple API calls, driving the cost to roughly $2,600 according to the startup's estimate. [Anthropic] The startup claims its model processes the full 12 million tokens in a single forward pass for $8, implying a per-token cost roughly 0.3% of Anthropic's.
The Technical Claim
The startup says it cracked a decade-old limit on quadratic attention scaling. Standard transformer attention computes pairwise interactions between all tokens, yielding O(n²) memory and compute costs that make 12M-token contexts impractical on current hardware. The company claims a novel sparse attention mechanism reduces this to O(n), though it has not released a preprint, model weights, or benchmark results on standard long-context evaluations such as RULER or Needle-in-a-Haystack.
Skepticism Warranted
Without an arXiv paper, open-source code, or third-party verification, the claim sits firmly in the "extraordinary claims require extraordinary evidence" category. Several startups have previously claimed linear-attention breakthroughs — including S4 (Gu et al. 2021), Mamba (Gu and Dao 2023), and RWKV (Peng et al. 2023) — but none have demonstrated competitive quality at 12M-token scale on standard benchmarks while maintaining claimed cost savings. The company did not disclose its model architecture, training data, parameter count, or inference hardware.
Context: The Long-Context Arms Race
The claim arrives as major labs race to extend context windows. Anthropic's Claude Opus 4.6 supports 200K tokens. Google's Gemini 1.5 Pro offers 1 million tokens in preview, priced at $10 per million input tokens. OpenAI's GPT-4o supports 128K tokens. A 12M-token context — roughly the length of 24,000 pages of text — would be an order of magnitude beyond any publicly available production model. If verified, the startup's approach could unlock use cases in legal document analysis, codebase-wide reasoning, and scientific literature review that current models cannot address economically.
Key Takeaways
- Miami startup claims 12M-token LLM inference for $8 vs.
- $2,600 on Claude Opus 4.6.
- No paper or benchmarks released yet.
What to watch
Watch for the startup to release an arXiv preprint or open-source model weights. Without independent verification on RULER or Needle-in-a-Haystack, the claim remains unsubstantiated. If a third-party benchmark confirms 12M-token throughput at $8, expect immediate replication attempts from major labs.
Source: pub.towardsai.net









