classification

30 articles about classification in AI news

The Fine-Grained Vision Gap: Why VLMs Excel at Conversation But Fail at Classification

New research reveals vision-language models struggle with fine-grained visual classification despite excelling at complex reasoning tasks. The study identifies architectural and training factors creating this disconnect, with implications for AI development.

Feb 23, 202670% relevant

Meta-Harness from Stanford/MIT Shows System Code Creates 6x AI Performance Gap

Stanford and MIT researchers show AI performance depends as much on the surrounding system code (the 'harness') as the model itself. Their Meta-Harness framework automatically improves this code, yielding significant gains in reasoning and classification tasks.

Apr 6, 202695% relevant

Meta-Harness Framework Automates AI Agent Engineering, Achieves 6x Performance Gap on Same Model

A new framework called Meta-Harness automates the optimization of AI agent harnesses—the system prompts, tools, and logic that wrap a model. By analyzing raw failure logs at scale, it improved text classification by 7.7 points while using 4x fewer tokens, demonstrating that harness engineering is a major leverage point as model capabilities converge.

Mar 30, 202691% relevant

98× Faster LLM Routing Without a Dedicated GPU: Technical Breakthrough for vLLM Semantic Router

New research presents a three-stage optimization pipeline for the vLLM Semantic Router, achieving 98× speedup and enabling long-context classification on shared GPUs. This solves critical memory and latency bottlenecks for system-level LLM routing.

Mar 16, 202680% relevant

Beyond Simple Recognition: How DeepIntuit Teaches AI to 'Reason' About Videos

Researchers have developed DeepIntuit, a new AI framework that moves video classification from simple pattern imitation to intuitive reasoning. The system uses vision-language models and reinforcement learning to handle complex, real-world video variations where traditional models fail.

Mar 12, 202684% relevant

CoRe-BT: The Missing Piece for AI Brain Tumor Diagnosis

Researchers introduce CoRe-BT, a multimodal benchmark combining MRI, pathology images, and text reports for brain tumor typing. The dataset addresses real-world clinical challenges where diagnostic data is often incomplete, enabling more robust AI models for glioma classification.

Mar 5, 202680% relevant

Agent Harnessing: The Infrastructure That Makes AI Agents Work

A detailed technical guide argues that the model is not the hard part of building AI agents. The six-component harness — context management, memory, tools, control flow, verification, and coordination — is what separates production-grade agents from those that fail silently.

Apr 25, 202678% relevant

How a Nursing Student Used Claude Haiku to Build a 660K-Page Drug Database Solo

Learn how Claude Haiku enabled a solo developer to classify thousands of medical conditions and build a production-grade pharmaceutical database.

Apr 25, 202675% relevant

The Developer's Guide to Finetuning LLMs

A developer-focused article outlines decision frameworks for LLM finetuning—covering when it's worth the cost, how to approach it, and key trade-offs. For retail leaders, this is a practical primer on customizing models for brand-specific tasks.

Apr 24, 202678% relevant

ECLASS-Augmented Semantic Product Search

Researchers systematically evaluated LLM-assisted dense retrieval for semantic product search on industrial electronic components. Augmenting embeddings with ECLASS hierarchical metadata created a crucial semantic bridge, achieving 94.3% Hit_Rate@5 versus 31.4% for BM25.

Apr 22, 202678% relevant

Microsoft, Google Shift to Range-Based AI Capacity Planning at DC World 2026

At Data Center World 2026, Microsoft and Google revealed they've shifted from point forecasts to range-based planning for AI workloads, with weekly reviews and modular infrastructure to absorb demand volatility.

Apr 22, 202694% relevant

CGCMA Model Achieves +0.449 Sharpe Ratio in Asynchronous Crypto News Fusion

Researchers propose CGCMA, a model for fusing sporadic news with continuous market data. It achieved a +0.449 Sharpe ratio on a new crypto trading benchmark, showing gains not explained by simple heuristics.

Apr 21, 202685% relevant

DNL Method Finds 2 Bits That Crash ResNet-50, Qwen3-30B

Researchers introduced Deep Neural Lesion (DNL), a method to find critical parameters. Flipping just two sign bits reduced ResNet-50 accuracy by 99.8% and Qwen3-30B reasoning to 0%.

Apr 20, 202695% relevant

Why Claude Code's 'Tool Calls' Aren't Hooks — And How to Design for Its

Understanding Claude's 8-step tool pipeline—from edge routing to result injection—is critical for structuring error handling, timeouts, and debugging in production applications.

Apr 20, 2026100% relevant

Install token-ninja: The MCP Server That Saves Tokens on Common Shell Commands

A new MCP server, token-ninja, automatically runs simple shell commands locally instead of sending them to Claude, cutting token usage and speeding up your workflow.

Apr 20, 2026100% relevant

OVRSISBenchV2: New 170K-Image Benchmark for Realistic Remote Sensing AI

A new benchmark, OVRSISBenchV2, with 170K images and 128 categories, sets a more realistic test for geospatial AI segmentation. The accompanying Pi-Seg model uses learnable semantic noise to broaden feature space and improve transfer.

Apr 20, 202688% relevant

NSA Uses Anthropic's Claude Mythos Despite 'Supply Chain Risk' Label

The National Security Agency is using Anthropic's Claude Mythos Preview for its capabilities, despite having labeled Anthropic itself as a potential supply chain risk. This highlights the tension between security concerns and the operational need for cutting-edge AI.

Apr 19, 202697% relevant

ML-Master 2.0 Hits 56.44% on MLE-Bench in 24-Hour Agentic Science Run

Researchers from Shanghai Jiao Tong University demonstrated ML-Master 2.0, an autonomous research agent that operated continuously for 24 hours on the MLE-Bench, achieving a 56.44% medal rate. The breakthrough centers on Hierarchical Cognitive Caching for state management, not reasoning, enabling long-horizon scientific workflows.

Apr 19, 202687% relevant

NVIDIA's Audio Flamingo Next: 30-Min Audio, Time-Grounded Reasoning

NVIDIA has launched Audio Flamingo Next, a next-generation open audio-language model supporting 30-minute audio inputs and time-grounded reasoning. Trained on over 1 million hours of data, it reportedly outperforms larger models on key audio understanding benchmarks.

Apr 19, 202695% relevant

Claude Code's New Repo-Resolver Fixes Monorepo and Remote URL Headaches

Claude Code's runtime now uses a unified repo-resolver package, providing consistent project identification across all its services and correctly handling monorepos and various git remote URL formats.

Apr 19, 202688% relevant

BERT-as-a-Judge Matches LLM-as-a-Judge Performance at Fraction of Cost

Researchers propose 'BERT-as-a-Judge,' a lightweight evaluation method that matches the performance of costly LLM-as-a-Judge setups. This could drastically reduce the cost of automated LLM evaluation pipelines.

Apr 19, 202685% relevant

Paper Proposes 'Artificial Scientist' as New AGI Definition

A new paper defines AGI as an 'artificial scientist'—a system that adapts as generally as a human scientist under computational limits. This reframes the goal from passing benchmarks to autonomous planning, causal learning, and exploration.

Apr 17, 202685% relevant

WebAI's Open-Source Model Hits #1 on MTEB Retrieval Leaderboard

WebAI has open-sourced a document retrieval model that currently holds the #1 position on the Massive Text Embedding Benchmark (MTEB) leaderboard. This provides a high-performance, free alternative to closed-source embedding APIs used in Retrieval-Augmented Generation (RAG) pipelines.

Apr 17, 202687% relevant

The Hidden Cost of AI Translation Layers in Global Customer Support

An article argues that using a basic translation layer for multilingual AI customer support is a costly mistake. It fails to convey cultural context and appropriate tone, leading to higher churn and lower satisfaction in non-English markets. The solution requires treating multilingual support as a core operational capability, not just a technical add-on.

Apr 16, 202694% relevant

How to Build a Claude Code Fallback System with Hermes Agent and Qwen3.6

Set up Hermes Agent with open models as a cost-effective Claude Code alternative for routine tasks, reserving Claude for complex refactors.

Apr 16, 2026100% relevant

Google Negotiates Pentagon AI Deal with OpenAI's 'All Lawful Uses' Terms

Google is in talks with the Pentagon to deploy Gemini under terms mirroring OpenAI's 'all lawful uses' contract, a reversal from its 2018 Project Maven withdrawal. Anthropic remains excluded for refusing to drop safeguards against autonomous weapons.

Apr 16, 202697% relevant

Anthropic & Nature Paper: LLMs Pass Traits via 'Subliminal Learning'

Anthropic co-authored a paper in Nature demonstrating that large language models can learn and pass on hidden 'subliminal' signals embedded in training data, such as preferences or misaligned objectives. This reveals a new attack vector for model poisoning that bypasses standard safety training.

Apr 15, 202695% relevant

OpenAI Shifts ChatGPT Ads to CPC, Targets $11B Revenue by 2027

OpenAI is restructuring ChatGPT advertising, moving from impression-based pricing to cost-per-click and conversion-driven models. This shift aims to compete directly with Google and Meta in intent-based advertising, targeting $2.4B revenue this year and $11B by 2027.

Apr 15, 202695% relevant

LLM Schema-Adaptive Method Enables Zero-Shot EHR Transfer

Researchers propose Schema-Adaptive Tabular Representation Learning, an LLM-driven method that transforms structured variables into semantic statements. It enables zero-shot alignment across unseen EHR schemas and outperforms clinical baselines, including neurologists, on dementia diagnosis tasks.

Apr 15, 202699% relevant

A-R Space Framework Profiles LLM Agent Execution Behavior Across Risk Contexts

Researchers propose the A-R Space, measuring Action Rate and Refusal Signal to profile LLM agent behavior across four risk contexts and three autonomy levels. This provides a deployment-oriented framework for selecting agents based on organizational risk tolerance.

Apr 15, 202696% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety