Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A person analyzes a complex data visualization on a large screen, showing neural network nodes and overlapping…

LLM 'Declared Losses' Reveal Epistemic Nuance Missed by Neutrosophic Scalars

A study extending neutrosophic logic evaluation of LLMs finds scalar T/I/F outputs are insufficient, collapsing paradox, ignorance, and contingency into identical scores. Adding structured 'declared loss' descriptions recovers these distinctions with Jaccard similarity <0.10.

AAAla SMITH & AI Research Desk·Apr 14, 2026·7 min read··157 views·AI-Generated·Report error

Source: arxiv.orgvia arxiv_aiSingle Source

TL;DR

New research shows scalar truth/falsity scores from LLMs collapse key epistemic distinctions; adding structured 'declared loss' descriptions recovers them with 90%+ separation.

From Scalars to Tensors: Why LLMs Need to Declare What They Can't Know

A new paper from researchers extending the work of Leyva-Vázquez and Smarandache (2025) reveals a fundamental limitation in how we measure what large language models "know"—and proposes a surprisingly simple solution. The research, published on arXiv in March 2026, demonstrates that while neutrosophic logic evaluation (measuring independent Truth, Indeterminacy, and Falsity dimensions) reveals important patterns in LLM reasoning, it still collapses critical epistemic distinctions that can be recovered by asking models to describe what they cannot evaluate.

Key Takeaways

A study extending neutrosophic logic evaluation of LLMs finds scalar T/I/F outputs are insufficient, collapsing paradox, ignorance, and contingency into identical scores.
Adding structured 'declared loss' descriptions recovers these distinctions with Jaccard similarity <0.10.

The Neutrosophic Baseline and Its Limits

Neutrosophic logic, introduced to AI evaluation by Leyva-Vázquez and Smarandache in 2025, breaks from traditional probabilistic frameworks where truth values must sum to 1. Instead, it treats Truth (T), Indeterminacy (I), and Falsity (F) as independent dimensions. This allows for the identification of "hyper-truth" cases where T+I+F > 1.0—situations where a model simultaneously assigns high truth, high indeterminacy, and even some falsity to the same proposition.

The 2025 study found this hyper-truth phenomenon in 35% of complex epistemic cases evaluated by LLMs. The new research first replicates and extends this finding across five model families from major vendors:

Claude 3.7 Anthropic 87% Llama 3.3 Meta 82% DeepSeek-R1 DeepSeek 85% Qwen 2.5 Alibaba 81% Mistral-Large Mistral 85%

Average across all models: 84%

This confirms the hyper-truth phenomenon is robust and cross-vendor under consistent prompting protocols. But the researchers identified a more fundamental problem: scalar T/I/F outputs, even when independent, still collapse distinct epistemic situations into identical numerical representations.

The Absorption Problem: When All Zeros Look Alike

The critical limitation emerges in what the researchers term the "Absorption" position: when a model outputs T=0, I=1, F=0. This identical scalar triplet can represent fundamentally different epistemic situations:

Figure 2: Figure 2: Scalar distance vs. loss Jaccard similarity

Paradox: "This statement is false" (logical contradiction)
Ignorance: "The exact population of ancient Carthage in 200 BCE" (missing information)
Contingency: "It will rain in Tokyo next Tuesday" (future uncertainty)

"Neutrosophic logic was designed to preserve distinctions that classical logic collapses," the authors note. "But we found that scalar T/I/F itself collapses distinctions it should preserve—paradox, ignorance, and contingency all map to the same (0,1,0) output."

Declared Losses: From Scalars to Tensors

The solution proposed is elegantly simple: extend evaluation from scalar outputs to tensor-structured outputs that include both numerical scores and declared losses—structured descriptions of what the model cannot evaluate and why.

When models producing identical (0,1,0) scalars for paradox versus ignorance cases were asked to declare their losses, they produced nearly disjoint vocabularies:

For paradoxes: Loss descriptions included "logical contradiction," "self-reference," "violates bivalence," with severity ratings indicating fundamental unresolvability.
For ignorance: Loss descriptions included "historical records incomplete," "archeological evidence conflicting," "no authoritative source," with severity ratings indicating information gaps.

The quantitative separation was stark: Jaccard similarity between loss description keywords for paradox versus ignorance cases was < 0.10, indicating nearly disjoint vocabularies. Domain-specific terminology and severity ratings provided clear differentiation where scalars showed none.

Technical Implementation and Results

The researchers developed a prompting framework that asks models to output:

{
  "T": 0.0,
  "I": 1.0,
  "F": 0.0,
  "losses": [
    {
      "domain": "logical consistency",
      "description": "statement contains self-referential contradiction",
      "severity": "fundamental",
      "recoverable": false
    }
  ]
}

This tensor-structured output (scalars + structured losses) was evaluated across 500 complex epistemic cases spanning logical paradoxes, historical unknowns, scientific uncertainties, and ethical dilemmas. The declared losses not only differentiated cases that had identical scalar outputs but also provided actionable diagnostics for model limitations.

Key finding: Models consistently produced more nuanced and accurate loss declarations for domains where they had stronger training data, suggesting declared losses could serve as a proxy for model confidence in specific knowledge domains.

Implications for LLM Evaluation and Deployment

The research suggests several practical implications:

Figure 1: Figure 1: Three philosophical positions on the liar’s paradox

Benchmarking enhancement: Future LLM evaluations should include declared loss analysis alongside traditional accuracy metrics.
Trust calibration: End-users could receive not just model answers but structured explanations of what the model cannot determine.
Training signal: Declared losses could provide richer training signals for improving model epistemic honesty.

"Scalar T/I/F is necessary but insufficient," the authors conclude. "Tensor-structured output provides a more faithful model of LLM epistemic capabilities by preserving distinctions that matter for real-world deployment."

gentic.news Analysis

This research arrives at a critical juncture in LLM evaluation methodology. Following our October 2025 coverage of Leyva-Vázquez and Smarandache's original neutrosophic framework, which itself was a response to the limitations of probability-only evaluations we documented in our "Beyond Confidence Scores" series, this work represents the next logical evolution: from detecting hyper-truth to explaining its nature.

The finding that 84% of complex evaluations show hyper-truth across major model families—up from 35% in the 2025 study—suggests either that models have become more nuanced in their epistemic reasoning or that evaluation protocols have improved. Given the timeline (the original study evaluated 2024-era models while this uses late-2025/early-2026 models), both factors likely contribute.

This work connects to several trends we've tracked: Anthropic's constitutional AI efforts to make model limitations explicit, Meta's Llama Guard framework for structured output, and the broader industry movement toward uncertainty quantification in AI systems. The declared losses concept particularly aligns with DeepMind's recent work on model introspection protocols, though this paper takes a more structured, evaluation-focused approach.

Practically, this research provides a methodology that could bridge the gap between academic evaluation and production deployment. As enterprises increasingly demand auditable AI reasoning for regulatory compliance (especially under the EU AI Act's transparency requirements), tensor-structured outputs with declared losses offer a concrete implementation path. The near-disjoint loss vocabularies for different uncertainty types (<0.10 Jaccard similarity) suggest this isn't just theoretical—models genuinely differentiate these cases in describable ways.

Frequently Asked Questions

What is neutrosophic logic in AI evaluation?

Neutrosophic logic is a framework for evaluating AI systems that treats Truth (T), Indeterminacy (I), and Falsity (F) as independent dimensions not constrained to sum to 1.0. This allows detection of "hyper-truth" cases where T+I+F > 1.0, representing situations where models assign simultaneous truth, indeterminacy, and falsity values—a more nuanced representation than traditional probability scores.

How do "declared losses" differ from confidence scores?

Confidence scores are single numerical values (usually between 0 and 1) representing how sure a model is about its answer. Declared losses are structured descriptions explaining what the model cannot determine and why—including domain categorization, specific limitations, severity ratings, and recoverability assessments. Where confidence scores collapse different uncertainty types into one number, declared losses preserve distinctions between logical paradoxes, information gaps, and future contingencies.

Which LLMs were tested in this research?

The study evaluated five model families from major vendors: Anthropic's Claude 3.7, Meta's Llama 3.3, DeepSeek's DeepSeek-R1, Alibaba's Qwen 2.5, and Mistral's Mistral-Large. All showed hyper-truth rates between 81-87% on complex epistemic evaluations, with an average of 84% across all models.

Could declared losses be used to improve LLM training?

Yes, the researchers suggest declared losses could provide richer training signals than traditional right/wrong feedback. By training models to not only produce answers but also accurately declare what they cannot determine, developers could improve epistemic honesty—the model's ability to recognize and communicate its own limitations. This aligns with reinforcement learning from human feedback (RLHF) approaches that reward appropriate uncertainty expression.

Source: gentic.news · Apr 14, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This paper represents a meaningful advance in LLM evaluation methodology, moving beyond detection of epistemic states to their explanation. The jump from 35% to 84% hyper-truth detection between the 2025 and 2026 studies is striking and warrants investigation: is this due to better evaluation protocols, more capable models, or both? Given the models tested include late-2025 releases like Claude 3.7 and Llama 3.3, improved model capabilities likely contribute significantly. The declared losses framework has immediate practical applications. For enterprises deploying LLMs in high-stakes domains (healthcare, finance, legal), understanding whether a model's uncertainty stems from logical paradox versus missing data is crucial for risk assessment. The structured output format proposed is API-friendly and could be implemented alongside existing chat completion endpoints with minimal disruption. Technically, the most interesting finding is the low Jaccard similarity (<0.10) between loss vocabularies for different uncertainty types. This suggests models aren't just generating boilerplate "I don't know" statements but producing genuinely differentiated descriptions based on the nature of the epistemic challenge. Future work should investigate whether this differentiation capability correlates with model size, architecture, or training methodology. From a research perspective, this work connects to broader efforts in mechanistic interpretability. The declared losses essentially provide a natural language window into how models represent different types of uncertainty internally. Researchers could potentially use these declarations as supervision for probing studies that map uncertainty representations to specific model components or attention patterns.

#uncertainty #llms #research #evaluation #interpretability

Compare side-by-side

Leyva-Vázquez vs Smarandache

→

Mentioned in this article

neutrosophic logic declared loss Leyva-Vázquez Smarandache

Enjoyed this article?