A new paper from researchers extending the work of Leyva-Vázquez and Smarandache (2025) reveals a fundamental limitation in how we measure what large language models "know"—and proposes a surprisingly simple solution. The research, published on arXiv in March 2026, demonstrates that while neutrosophic logic evaluation (measuring independent Truth, Indeterminacy, and Falsity dimensions) reveals important patterns in LLM reasoning, it still collapses critical epistemic distinctions that can be recovered by asking models to describe what they cannot evaluate.
The Neutrosophic Baseline and Its Limits
Neutrosophic logic, introduced to AI evaluation by Leyva-Vázquez and Smarandache in 2025, breaks from traditional probabilistic frameworks where truth values must sum to 1. Instead, it treats Truth (T), Indeterminacy (I), and Falsity (F) as independent dimensions. This allows for the identification of "hyper-truth" cases where T+I+F > 1.0—situations where a model simultaneously assigns high truth, high indeterminacy, and even some falsity to the same proposition.
The 2025 study found this hyper-truth phenomenon in 35% of complex epistemic cases evaluated by LLMs. The new research first replicates and extends this finding across five model families from major vendors:
Claude 3.7 Anthropic 87% Llama 3.3 Meta 82% DeepSeek-R1 DeepSeek 85% Qwen 2.5 Alibaba 81% Mistral-Large Mistral 85%Average across all models: 84%
This confirms the hyper-truth phenomenon is robust and cross-vendor under consistent prompting protocols. But the researchers identified a more fundamental problem: scalar T/I/F outputs, even when independent, still collapse distinct epistemic situations into identical numerical representations.
The Absorption Problem: When All Zeros Look Alike
The critical limitation emerges in what the researchers term the "Absorption" position: when a model outputs T=0, I=1, F=0. This identical scalar triplet can represent fundamentally different epistemic situations:

- Paradox: "This statement is false" (logical contradiction)
- Ignorance: "The exact population of ancient Carthage in 200 BCE" (missing information)
- Contingency: "It will rain in Tokyo next Tuesday" (future uncertainty)
"Neutrosophic logic was designed to preserve distinctions that classical logic collapses," the authors note. "But we found that scalar T/I/F itself collapses distinctions it should preserve—paradox, ignorance, and contingency all map to the same (0,1,0) output."
Declared Losses: From Scalars to Tensors
The solution proposed is elegantly simple: extend evaluation from scalar outputs to tensor-structured outputs that include both numerical scores and declared losses—structured descriptions of what the model cannot evaluate and why.
When models producing identical (0,1,0) scalars for paradox versus ignorance cases were asked to declare their losses, they produced nearly disjoint vocabularies:
- For paradoxes: Loss descriptions included "logical contradiction," "self-reference," "violates bivalence," with severity ratings indicating fundamental unresolvability.
- For ignorance: Loss descriptions included "historical records incomplete," "archeological evidence conflicting," "no authoritative source," with severity ratings indicating information gaps.
The quantitative separation was stark: Jaccard similarity between loss description keywords for paradox versus ignorance cases was < 0.10, indicating nearly disjoint vocabularies. Domain-specific terminology and severity ratings provided clear differentiation where scalars showed none.
Technical Implementation and Results
The researchers developed a prompting framework that asks models to output:
{
"T": 0.0,
"I": 1.0,
"F": 0.0,
"losses": [
{
"domain": "logical consistency",
"description": "statement contains self-referential contradiction",
"severity": "fundamental",
"recoverable": false
}
]
}
This tensor-structured output (scalars + structured losses) was evaluated across 500 complex epistemic cases spanning logical paradoxes, historical unknowns, scientific uncertainties, and ethical dilemmas. The declared losses not only differentiated cases that had identical scalar outputs but also provided actionable diagnostics for model limitations.
Key finding: Models consistently produced more nuanced and accurate loss declarations for domains where they had stronger training data, suggesting declared losses could serve as a proxy for model confidence in specific knowledge domains.
Implications for LLM Evaluation and Deployment
The research suggests several practical implications:

- Benchmarking enhancement: Future LLM evaluations should include declared loss analysis alongside traditional accuracy metrics.
- Trust calibration: End-users could receive not just model answers but structured explanations of what the model cannot determine.
- Training signal: Declared losses could provide richer training signals for improving model epistemic honesty.
"Scalar T/I/F is necessary but insufficient," the authors conclude. "Tensor-structured output provides a more faithful model of LLM epistemic capabilities by preserving distinctions that matter for real-world deployment."
gentic.news Analysis
This research arrives at a critical juncture in LLM evaluation methodology. Following our October 2025 coverage of Leyva-Vázquez and Smarandache's original neutrosophic framework, which itself was a response to the limitations of probability-only evaluations we documented in our "Beyond Confidence Scores" series, this work represents the next logical evolution: from detecting hyper-truth to explaining its nature.
The finding that 84% of complex evaluations show hyper-truth across major model families—up from 35% in the 2025 study—suggests either that models have become more nuanced in their epistemic reasoning or that evaluation protocols have improved. Given the timeline (the original study evaluated 2024-era models while this uses late-2025/early-2026 models), both factors likely contribute.
This work connects to several trends we've tracked: Anthropic's constitutional AI efforts to make model limitations explicit, Meta's Llama Guard framework for structured output, and the broader industry movement toward uncertainty quantification in AI systems. The declared losses concept particularly aligns with DeepMind's recent work on model introspection protocols, though this paper takes a more structured, evaluation-focused approach.
Practically, this research provides a methodology that could bridge the gap between academic evaluation and production deployment. As enterprises increasingly demand auditable AI reasoning for regulatory compliance (especially under the EU AI Act's transparency requirements), tensor-structured outputs with declared losses offer a concrete implementation path. The near-disjoint loss vocabularies for different uncertainty types (<0.10 Jaccard similarity) suggest this isn't just theoretical—models genuinely differentiate these cases in describable ways.
Frequently Asked Questions
What is neutrosophic logic in AI evaluation?
Neutrosophic logic is a framework for evaluating AI systems that treats Truth (T), Indeterminacy (I), and Falsity (F) as independent dimensions not constrained to sum to 1.0. This allows detection of "hyper-truth" cases where T+I+F > 1.0, representing situations where models assign simultaneous truth, indeterminacy, and falsity values—a more nuanced representation than traditional probability scores.
How do "declared losses" differ from confidence scores?
Confidence scores are single numerical values (usually between 0 and 1) representing how sure a model is about its answer. Declared losses are structured descriptions explaining what the model cannot determine and why—including domain categorization, specific limitations, severity ratings, and recoverability assessments. Where confidence scores collapse different uncertainty types into one number, declared losses preserve distinctions between logical paradoxes, information gaps, and future contingencies.
Which LLMs were tested in this research?
The study evaluated five model families from major vendors: Anthropic's Claude 3.7, Meta's Llama 3.3, DeepSeek's DeepSeek-R1, Alibaba's Qwen 2.5, and Mistral's Mistral-Large. All showed hyper-truth rates between 81-87% on complex epistemic evaluations, with an average of 84% across all models.
Could declared losses be used to improve LLM training?
Yes, the researchers suggest declared losses could provide richer training signals than traditional right/wrong feedback. By training models to not only produce answers but also accurately declare what they cannot determine, developers could improve epistemic honesty—the model's ability to recognize and communicate its own limitations. This aligns with reinforcement learning from human feedback (RLHF) approaches that reward appropriate uncertainty expression.









