curation
30 articles about curation in AI news
New Paper Coins 'Curation Debt' — Benchmarks Measure Data Leakage, Not Capability
New paper coins 'curation debt' — benchmarks like MMLU measure data leakage, not capability. Proposes adversarial dynamic benchmarks.
X Launches Custom Timelines, AI-Powered Feed Curation Tool
X has launched 'Custom Timelines,' a feature that uses AI to let users create and follow personalized feeds based on curated lists of accounts, moving beyond the main algorithmic 'For You' feed.
MeiGen Revolutionizes AI Art Creation with Automated Prompt Curation
MeiGen, a new open-source tool, automatically scrapes and curates trending AI image prompts from social media, solving the problem of prompt discovery and organization for digital artists. The free platform aggregates weekly collections without requiring manual bookmarking or searching.
Pioneer Agent: A Closed-Loop System for Automating Small Language Model
Researchers present Pioneer Agent, a system that automates the adaptation of small language models to specific tasks. It handles data curation, failure diagnosis, and iterative training, showing significant performance gains in benchmarks and production-style deployments. This addresses a major engineering bottleneck for deploying efficient, specialized AI.
AI emerges as a strategic priority for luxury as accelerating consumer use
A Bain & Company and Comité Colbert report declares AI a strategic priority for luxury brands, driven by accelerating consumer use that challenges the industry to reinvent customer discovery and experience. This matters as luxury houses face pressure to integrate AI without diluting brand exclusivity.
CLI-Universe: Qwen3-32B fine-tuned on 6K trajectories beats models 10x larger on Terminal-Bench 2.0
CLI-Universe synthesizes terminal-agent tasks; Qwen3-32B fine-tuned on 6K trajectories hits 33.4% on Terminal-Bench 2.0, beating models 10x larger.
Cursor Trains GPT-Size Model with 10-20x Compute
Cursor trained a GPT-size model from scratch with 10-20x more compute, announced at Compile. The move shifts from fine-tuning to pretraining for code generation.
Pareto LoRA Boosts Image Quality 44.9% vs Vanilla LoRA on Emu2
Pareto LoRA reformulates multimodal instruction tuning as bi-objective optimization, achieving up to 44.9% image quality gains on Emu2 while maintaining text performance.
Estonian Institute: Claude Tops Russian Propaganda Benchmark, Mistral Trails
Estonian Language Institute benchmark tests 60 AI models vs Russian propaganda. Claude tops, Mistral trails with 36.67% misinformation rate.
MA-ProofBench: GPT-5.5 Hits 16% on Math Analysis, Most Models Near 0%
MA-ProofBench, a new theorem-proving benchmark for mathematical analysis, shows GPT-5.5 achieving 16% on undergraduate problems and 5% on PhD-level, with most models near 0% on the harder set.
UniSound U2 Cuts Token Use 25%, Joins Top Chinese LLM Tier
UniSound's U2 foundation model cuts token consumption by 25% while matching top Chinese LLM performance, entering the top tier with an efficiency-first design.
Meesho Integrates AI-Powered Product Recommendation System
Meesho integrates an AI-powered recommendation system to personalize shopping. This matters as it shows how value e-commerce platforms adopt AI to compete with giants like Amazon and Google.
NanoGPT-Bench: A New Eval for Coding Agents Doing AI Research
IntologyAI released NanoGPT-Bench, an internal eval for coding agents on an AI R&D problem. No results or task specifics have been disclosed.
Hermes Agent's Three-Tier Memory Cuts Context Bloat, Keeps 2,200-Char Core
Hermes agent's three-tier memory uses two tiny markdown files (2,200 chars), SQLite FTS5 search (10ms over 10K docs), and 8 pluggable providers. The composition solves the always-on vs. deep recall trade-off.
VAB Benchmark: Top MLLMs Judge Beauty Correctly Only 26.5% of Time
Frontier MLLMs achieve only 26.5% accuracy on VAB, far below human 68.9%. Fine-tuning bridges the gap.
Almanac: Open-Source Wiki Auto-Updates From Claude Code Chats
Almanac auto-generates a markdown wiki from Claude Code chats and repo history, solving the agent context gap. Free open-source tool, MacOS-only.
Anthropic Ships Claude Opus 4.7: 2.1% SWE-Bench Gain Over 4.6
Anthropic released Claude Opus 4.7 with a 2.1-point SWE-Bench gain to 82.9, the smallest jump between Opus versions yet, signaling diminishing returns.
Ctx2Skill: Self-Play Framework Lets LMs Discover Skills Without Labels
Ctx2Skill discovers skills from context via multi-agent self-play without labels. Outputs plug into any LM, targeting manual prompt engineering bottlenecks.
Matt Pocock Open-Sources Claude Code Skill Pack for AI Agents
Matt Pocock open-sourced a Claude Code skill pack to improve AI agent behavior. The pack provides curated prompts and configurations for Anthropic's terminal-based coding tool.
GPT-5.5 Pro Leapfrogs on Epoch Benchmark; Base Model Beats Prior Pro
A tweet from @kimmonismus reveals GPT-5.5 Pro shows significant Epoch benchmark gains, and the non-Pro GPT-5.5 surpasses GPT-5.4 Pro, suggesting major efficiency improvements at OpenAI.
K-CARE: A New Framework Grounds LLMs in External Knowledge to Fix
K-CARE combines Symmetrical Contextual Anchoring (behavior data) and Analogical Prototype Reasoning (expert examples) to resolve e-commerce search relevance issues that pure LLM reasoning can't fix. Proven in offline and online A/B tests on a leading platform.
Alec Radford's 'Talk to the Past' AI Lets You Chat with History
A new AI project by Alec Radford and David Duvenaud lets you chat with simulated historical figures.
Hinton Rebrands AI Hallucinations as 'Confabulations'
Geoffrey Hinton redefines AI hallucinations as 'confabulations,' arguing that intelligence reconstructs reality into plausible stories rather than storing facts like a database.
San Francisco Shop Runs Entirely by AI Agent
A shop in San Francisco is fully operated by an AI agent, replacing human cashiers and assistants. The concept points toward fully autonomous retail experiences, though details on the technology stack remain thin.
Meta's Sapiens2: 1B Human Image ViTs for Pose, Segmentation, Normals
Meta open-sourced Sapiens2 on Hugging Face, a family of vision transformers pretrained on 1 billion human images for pose estimation, segmentation, normal estimation, and point maps. The models target high-resolution human-centric perception.
ItemRAG: A New RAG Approach for LLM-Based Recommendation That Retrieves
ItemRAG shifts RAG for LLM-based recommenders from user-history retrieval to fine-grained item-level retrieval, using co-purchase and semantic data to prioritize informative items. Experiments show consistent outperformance over existing methods, especially for cold-start items.
From Checkout to Trust Layer: How Merchants Can Prepare for Agentic Commerce
The article discusses the evolution of e-commerce from simple checkout processes to a future where AI shopping agents act on behalf of consumers. It argues that success in this 'agentic commerce' era depends on merchants building a robust trust layer with data security, transparency, and reliability at its core.
CAST: A New Framework for Semantic-Level Complementary Recommendations
Researchers propose CAST, a sequential recommendation framework that models transitions between discrete item semantic codes (e.g., specifications) and injects LLM-verified complementary knowledge. It achieves significant performance gains by moving beyond simplistic co-purchase statistics to capture genuine complementarity.
VoteGCL: A Novel LLM-Augmented Framework to Combat Data Sparsity in
A new paper introduces VoteGCL, a framework that uses few-shot LLM prompting and majority voting to create high-confidence synthetic data for graph-based recommendation systems. It integrates this data via graph contrastive learning to improve accuracy and mitigate bias, outperforming existing baselines.
Layers on Layers — How You Can Improve Your Recommendation Systems
An IBM article critiques monolithic recommendation engines for trying to do too much with one score. It proposes a layered architecture—candidate generation, ranking, and business logic—to improve performance and adaptability. This is a direct, practical framework for engineering teams.