cmu
18 articles about cmu in AI news
CMU's Gym-Anything Turns Any Software Into Agent Training Ground
CMU's Gym-Anything automates agent environment creation, producing CUA-World with 10,000+ tasks. Even strong models fail most long tasks, showing real computer-use work is unsolved.
CMU Benchmark: Claude Mythos Hits 9.9/16 on V8 Exploits, GPT-5.5 Trails at 5.5
CMU's ExploitBench shows Claude Mythos scores 9.9/16 on V8 exploits vs GPT-5.5's 5.5, but costs $36,428 per run — 12x more. The cost-performance tradeoff is the real story.
MIT/Oxford/CMU Paper: AI Can Boost Then Harm Human Performance
A collaborative paper from MIT, Oxford, and Carnegie Mellon reports AI assistance can improve human performance initially, but may lead to degradation over time due to over-reliance. This challenges the assumption that AI augmentation yields monotonic benefits.
CMU Study: Top LLMs Fail Simple Contradiction Tests, Lack True Reasoning
Carnegie Mellon researchers tested 14 leading LLMs on simple contradiction tasks; all failed consistently, revealing fundamental reasoning gaps despite advanced benchmarks. (199 chars)
CMU Research Identifies 'Biggest Unlock' for Coding Agents: Strategic Test Execution
New research from Carnegie Mellon University suggests the key advancement for AI coding agents lies not in raw code generation, but in developing strategies for how to run and interpret tests. This shifts focus from LLM capability to agentic reasoning.
Stanford/CMU Study: AI Agent Benchmarks Focus on 7.6% of Jobs, Ignoring Management, Legal, and Interpersonal Work
Researchers analyzed 43 AI benchmarks against 72,000+ real job tasks and found they overwhelmingly test programming/math skills, which represent only 7.6% of actual economic work. Management, legal, and interpersonal tasks—which dominate the labor market—are almost entirely absent from evaluation.
MIT/Oxford Study: GPT-5 Help Boosts Scores Now, Hurts Independent Problem-Solving Later
A new paper from MIT, Oxford, and CMU finds that using GPT-5 for direct answers improves short-term scores but reduces persistence and independent performance after assistance ends. The effect is linked to outsourcing mental effort, not AI exposure itself.
Open-Source Multi-Agent LLM System for Complex Software Engineering Tasks Released by Academic Consortium
A consortium of researchers from Stony Brook, CMU, Yale, UBC, and Fudan University has open-sourced a multi-agent LLM system specifically architected for complex software engineering. The release aims to provide a collaborative, modular framework for tackling tasks beyond single-agent capabilities.
Kering Deploys AI-Powered Sustainable Sourcing Assistant on Google Cloud
Kering launched a Sustainable Sourcing Assistant on Google Cloud's Vertex AI. The tool helps luxury brands like Gucci and Saint Laurent evaluate materials for environmental and social impact, advancing sustainability in procurement.
Shopify Details Generative AI Use Cases for Ecommerce (2026)
Shopify's 2026 guide details generative AI use cases for ecommerce, including conversational AI for sales and product catalog management via the Storefront API. This matters as retailers seek practical AI integrations to enhance operations and customer engagement.
Nvidia Unveils Physical AI Agent Skills, 32B VLA Model at CVPR
Nvidia launched physical AI agent skills and a 32B VLA model at CVPR to automate AV and robotics workflows, addressing the fragmented tooling bottleneck.
ByteDance Open-Sources BAGEL: 7B Multimodal Model for Image Gen, Editing, Understanding
ByteDance open-sourced BAGEL, a 7B multimodal model for image gen, editing, style transfer, and understanding under Apache 2.0.
IPCCF: A New Graph-Based Approach to Disentangle User Intent for Better
A new research paper introduces Intent Propagation Contrastive Collaborative Filtering (IPCCF), a method designed to improve recommendation systems by more accurately disentangling the underlying intents behind user-item interactions. It addresses limitations in existing methods by incorporating broader graph structure and using contrastive learning for direct supervision, showing superior performance in experiments.
Research Shows AI Models Can 'Infect' Others with Hidden Bias
A study reveals AI models can transfer hidden biases to other models via training data, even without direct instruction. This creates a risk of bias propagation across AI ecosystems.
NVIDIA Advances AI Robotics with Simulation-First Training, Isaac & Jetson
NVIDIA showcased AI robotics advances using foundation models and synthetic environments for training, enabling scalable deployment in real-world sectors like agriculture and solar. Key platforms are the Isaac simulator and Jetson edge AI hardware.
Study: 10 Minutes with ChatGPT Cuts Problem-Solving Rate from 73% to 57%
Researchers from Carnegie Mellon, Oxford, MIT, and UCLA found that just 10 minutes of ChatGPT use reduced participants' independent problem-solving success from 73% to 57%. The effect was strongest in users who sought direct answers, whose performance fell below their original baseline.
PhAIL: Open Benchmark for Robot AI on Real Hardware Shows Best Model at 5% of Human Throughput
Researchers have launched PhAIL (phail.ai), an open benchmark for evaluating robot AI systems on real hardware using the DROID platform, with the best-performing model achieving only 5% of human throughput and requiring intervention every 4 minutes.
CARLA-Air Unifies CARLA and AirSim Simulators in Single Unreal Engine Process for Embodied AI
CARLA-Air merges the CARLA autonomous driving and AirSim drone simulators into one Unreal Engine process, enabling zero-latency air-ground sensor synchronization with 18 sensor types for embodied AI training.