classification
30 articles about classification in AI news
The Fine-Grained Vision Gap: Why VLMs Excel at Conversation But Fail at Classification
New research reveals vision-language models struggle with fine-grained visual classification despite excelling at complex reasoning tasks. The study identifies architectural and training factors creating this disconnect, with implications for AI development.
Meta-Harness from Stanford/MIT Shows System Code Creates 6x AI Performance Gap
Stanford and MIT researchers show AI performance depends as much on the surrounding system code (the 'harness') as the model itself. Their Meta-Harness framework automatically improves this code, yielding significant gains in reasoning and classification tasks.
Meta-Harness Framework Automates AI Agent Engineering, Achieves 6x Performance Gap on Same Model
A new framework called Meta-Harness automates the optimization of AI agent harnesses—the system prompts, tools, and logic that wrap a model. By analyzing raw failure logs at scale, it improved text classification by 7.7 points while using 4x fewer tokens, demonstrating that harness engineering is a major leverage point as model capabilities converge.
98× Faster LLM Routing Without a Dedicated GPU: Technical Breakthrough for vLLM Semantic Router
New research presents a three-stage optimization pipeline for the vLLM Semantic Router, achieving 98× speedup and enabling long-context classification on shared GPUs. This solves critical memory and latency bottlenecks for system-level LLM routing.
Beyond Simple Recognition: How DeepIntuit Teaches AI to 'Reason' About Videos
Researchers have developed DeepIntuit, a new AI framework that moves video classification from simple pattern imitation to intuitive reasoning. The system uses vision-language models and reinforcement learning to handle complex, real-world video variations where traditional models fail.
CoRe-BT: The Missing Piece for AI Brain Tumor Diagnosis
Researchers introduce CoRe-BT, a multimodal benchmark combining MRI, pathology images, and text reports for brain tumor typing. The dataset addresses real-world clinical challenges where diagnostic data is often incomplete, enabling more robust AI models for glioma classification.
Agent Harnessing: The Infrastructure That Makes AI Agents Work
A detailed technical guide argues that the model is not the hard part of building AI agents. The six-component harness — context management, memory, tools, control flow, verification, and coordination — is what separates production-grade agents from those that fail silently.
How a Nursing Student Used Claude Haiku to Build a 660K-Page Drug Database Solo
Learn how Claude Haiku enabled a solo developer to classify thousands of medical conditions and build a production-grade pharmaceutical database.
The Developer's Guide to Finetuning LLMs
A developer-focused article outlines decision frameworks for LLM finetuning—covering when it's worth the cost, how to approach it, and key trade-offs. For retail leaders, this is a practical primer on customizing models for brand-specific tasks.
ECLASS-Augmented Semantic Product Search
Researchers systematically evaluated LLM-assisted dense retrieval for semantic product search on industrial electronic components. Augmenting embeddings with ECLASS hierarchical metadata created a crucial semantic bridge, achieving 94.3% Hit_Rate@5 versus 31.4% for BM25.
Microsoft, Google Shift to Range-Based AI Capacity Planning at DC World 2026
At Data Center World 2026, Microsoft and Google revealed they've shifted from point forecasts to range-based planning for AI workloads, with weekly reviews and modular infrastructure to absorb demand volatility.
CGCMA Model Achieves +0.449 Sharpe Ratio in Asynchronous Crypto News Fusion
Researchers propose CGCMA, a model for fusing sporadic news with continuous market data. It achieved a +0.449 Sharpe ratio on a new crypto trading benchmark, showing gains not explained by simple heuristics.
DNL Method Finds 2 Bits That Crash ResNet-50, Qwen3-30B
Researchers introduced Deep Neural Lesion (DNL), a method to find critical parameters. Flipping just two sign bits reduced ResNet-50 accuracy by 99.8% and Qwen3-30B reasoning to 0%.
Why Claude Code's 'Tool Calls' Aren't Hooks — And How to Design for Its
Understanding Claude's 8-step tool pipeline—from edge routing to result injection—is critical for structuring error handling, timeouts, and debugging in production applications.
Install token-ninja: The MCP Server That Saves Tokens on Common Shell Commands
A new MCP server, token-ninja, automatically runs simple shell commands locally instead of sending them to Claude, cutting token usage and speeding up your workflow.
OVRSISBenchV2: New 170K-Image Benchmark for Realistic Remote Sensing AI
A new benchmark, OVRSISBenchV2, with 170K images and 128 categories, sets a more realistic test for geospatial AI segmentation. The accompanying Pi-Seg model uses learnable semantic noise to broaden feature space and improve transfer.
NSA Uses Anthropic's Claude Mythos Despite 'Supply Chain Risk' Label
The National Security Agency is using Anthropic's Claude Mythos Preview for its capabilities, despite having labeled Anthropic itself as a potential supply chain risk. This highlights the tension between security concerns and the operational need for cutting-edge AI.
ML-Master 2.0 Hits 56.44% on MLE-Bench in 24-Hour Agentic Science Run
Researchers from Shanghai Jiao Tong University demonstrated ML-Master 2.0, an autonomous research agent that operated continuously for 24 hours on the MLE-Bench, achieving a 56.44% medal rate. The breakthrough centers on Hierarchical Cognitive Caching for state management, not reasoning, enabling long-horizon scientific workflows.
NVIDIA's Audio Flamingo Next: 30-Min Audio, Time-Grounded Reasoning
NVIDIA has launched Audio Flamingo Next, a next-generation open audio-language model supporting 30-minute audio inputs and time-grounded reasoning. Trained on over 1 million hours of data, it reportedly outperforms larger models on key audio understanding benchmarks.
Claude Code's New Repo-Resolver Fixes Monorepo and Remote URL Headaches
Claude Code's runtime now uses a unified repo-resolver package, providing consistent project identification across all its services and correctly handling monorepos and various git remote URL formats.
BERT-as-a-Judge Matches LLM-as-a-Judge Performance at Fraction of Cost
Researchers propose 'BERT-as-a-Judge,' a lightweight evaluation method that matches the performance of costly LLM-as-a-Judge setups. This could drastically reduce the cost of automated LLM evaluation pipelines.
Paper Proposes 'Artificial Scientist' as New AGI Definition
A new paper defines AGI as an 'artificial scientist'—a system that adapts as generally as a human scientist under computational limits. This reframes the goal from passing benchmarks to autonomous planning, causal learning, and exploration.
WebAI's Open-Source Model Hits #1 on MTEB Retrieval Leaderboard
WebAI has open-sourced a document retrieval model that currently holds the #1 position on the Massive Text Embedding Benchmark (MTEB) leaderboard. This provides a high-performance, free alternative to closed-source embedding APIs used in Retrieval-Augmented Generation (RAG) pipelines.
The Hidden Cost of AI Translation Layers in Global Customer Support
An article argues that using a basic translation layer for multilingual AI customer support is a costly mistake. It fails to convey cultural context and appropriate tone, leading to higher churn and lower satisfaction in non-English markets. The solution requires treating multilingual support as a core operational capability, not just a technical add-on.
How to Build a Claude Code Fallback System with Hermes Agent and Qwen3.6
Set up Hermes Agent with open models as a cost-effective Claude Code alternative for routine tasks, reserving Claude for complex refactors.
Google Negotiates Pentagon AI Deal with OpenAI's 'All Lawful Uses' Terms
Google is in talks with the Pentagon to deploy Gemini under terms mirroring OpenAI's 'all lawful uses' contract, a reversal from its 2018 Project Maven withdrawal. Anthropic remains excluded for refusing to drop safeguards against autonomous weapons.
Anthropic & Nature Paper: LLMs Pass Traits via 'Subliminal Learning'
Anthropic co-authored a paper in Nature demonstrating that large language models can learn and pass on hidden 'subliminal' signals embedded in training data, such as preferences or misaligned objectives. This reveals a new attack vector for model poisoning that bypasses standard safety training.
OpenAI Shifts ChatGPT Ads to CPC, Targets $11B Revenue by 2027
OpenAI is restructuring ChatGPT advertising, moving from impression-based pricing to cost-per-click and conversion-driven models. This shift aims to compete directly with Google and Meta in intent-based advertising, targeting $2.4B revenue this year and $11B by 2027.
LLM Schema-Adaptive Method Enables Zero-Shot EHR Transfer
Researchers propose Schema-Adaptive Tabular Representation Learning, an LLM-driven method that transforms structured variables into semantic statements. It enables zero-shot alignment across unseen EHR schemas and outperforms clinical baselines, including neurologists, on dementia diagnosis tasks.
A-R Space Framework Profiles LLM Agent Execution Behavior Across Risk Contexts
Researchers propose the A-R Space, measuring Action Rate and Refusal Signal to profile LLM agent behavior across four risk contexts and three autonomy levels. This provides a deployment-oriented framework for selecting agents based on organizational risk tolerance.