dataset

30 articles about dataset in AI news

Donate Claude Code Traces to Hugging Face's Open Dataset in One Command

Trace Commons lets Claude Code users donate anonymized session traces to an open CC-BY-4.0 dataset on Hugging Face. Run `/donate-trace` after open-source work to share how you solved problems — without exposing secrets or paths.

Jun 22, 202670% relevant

AllenAI's MolmoAct2: 720-Hour Bimanual Dataset, Beats GPT-5 on Robotics

AllenAI released MolmoAct2, an open robotics model with a 720-hour bimanual dataset, beating GPT-5 and Gemini Robotics on success rate (89.4% vs 82.1%) with 40% lower latency.

May 5, 202695% relevant

Apple Releases DFNDR-12M Dataset, Claims 5x CLIP Training Efficiency

Apple has open-sourced DFNDR-12M, a multimodal dataset of 12.8 million image-text pairs with synthetic captions and pre-computed embeddings. The company claims it enables up to 5x training efficiency over standard CLIP datasets.

Apr 22, 202685% relevant

Tencent Releases MegaStyle: 1.4M AI-Generated Image Style Dataset

Tencent has open-sourced MegaStyle, a 1.4 million image dataset for style transfer and text-to-image fine-tuning. It was generated by systematically pairing 170,000 style prompts with 400,000 content prompts using the Qwen-Image model.

Apr 21, 202685% relevant

FashionStylist: New Expert-Annotated Dataset Aims to Unify Multimodal

A new arXiv preprint introduces FashionStylist, a dataset with professional fashion annotations for item grounding, outfit completion, and outfit evaluation. It aims to address the fragmentation in existing fashion AI benchmarks by providing expert-level reasoning data.

Apr 13, 202686% relevant

Massive Video Reasoning Dataset Released, Reportedly 1000x Larger Than Predecessors

An unverified report claims the release of a video reasoning dataset roughly 1000x larger than existing benchmarks. If true, it would be a significant resource for training next-generation video understanding models.

Apr 8, 202699% relevant

Tencent Launches 2025 Ad Algorithm Challenge with Massive All-Modality Recommendation Datasets

Tencent has launched an open competition and released two industrial-scale datasets (TencentGR-1M and TencentGR-10M) to advance generative recommender systems. This has spurred related research into debiasing techniques and novel reranking frameworks, moving the field toward more holistic, multi-modal user modeling.

Apr 8, 202687% relevant

QUMPHY Project's D4 Report Establishes Six Benchmark Problems and Datasets for ML on PPG Signals

A new report from the EU-funded QUMPHY project establishes six benchmark problems and associated datasets for evaluating machine and deep learning methods on photoplethysmography (PPG) signals. This standardization effort is a foundational step for quantifying uncertainty in medical AI applications.

Apr 3, 202689% relevant

Unitree Robotics Releases UnifoLM-WBT-Dataset: A Large-Scale, Real-World Robotics Dataset for Embodied AI

Chinese robotics firm Unitree Robotics has open-sourced the UnifoLM-WBT-Dataset, a high-quality dataset derived from real-world robot operations. The release aims to accelerate training for embodied AI and large language models applied to physical systems.

Mar 28, 202685% relevant

DIET: A New Framework for Continually Distilling Streaming Datasets in Recommender Systems

Researchers propose DIET, a framework for streaming dataset distillation in recommender systems. It maintains a compact, evolving dataset (1-2% of original size) that preserves training-critical signals, reducing model iteration costs by up to 60x while maintaining performance trends.

Mar 27, 202688% relevant

Niantic's Pokémon GO Dataset of 30B Images Now Powers Centimeter-Precise Robotics Vision

Niantic's Lightship VPS, trained on 30 billion images from Pokémon GO players, now enables delivery robots to navigate with centimeter precision. The dataset represents the largest real-world visual positioning system ever created.

Mar 16, 202687% relevant

Massive Open-Source Dataset of Computer Screen Recordings Released to Train AI Agents

Researchers have released the world's largest open-source dataset of computer-use recordings on Hugging Face. The collection contains 48,478 screen recording videos totaling approximately 12,300 hours of professional software usage, licensed under CC-BY-4.0 for AI training and evaluation.

Mar 14, 202697% relevant

OpenAI's IH-Challenge Dataset: Teaching AI to Distinguish Trusted from Untrusted Instructions

OpenAI has released IH-Challenge, a novel training dataset designed to teach AI models to prioritize trusted instructions over untrusted ones. Early results indicate significant improvements in security and defenses against prompt injection attacks, marking a step toward more reliable and controllable AI systems.

Mar 11, 202697% relevant

HumanMCP Dataset Closes Critical Gap in AI Tool Evaluation

Researchers introduce HumanMCP, the first large-scale dataset featuring realistic, human-like queries for evaluating how AI systems retrieve and use tools from MCP servers. This addresses a critical limitation in current benchmarks that fail to represent real-world user interactions.

Mar 2, 202675% relevant

DeepVision-103K: The Math Dataset That Could Revolutionize AI's Visual Reasoning

Researchers have introduced DeepVision-103K, a comprehensive mathematical dataset with 103,000 verifiable visual instances designed to train multimodal AI models. Covering K-12 topics from geometry to statistics, this dataset addresses critical gaps in AI's visual reasoning capabilities.

Mar 1, 202685% relevant

DeepVision-103K: The Math Dataset That Could Revolutionize How AI 'Sees' and Reasons

Researchers have introduced DeepVision-103K, a massive dataset designed to train AI models to solve math problems by understanding both text and images. This approach could significantly improve how AI systems reason about the visual world.

Feb 20, 202678% relevant

A Reference Architecture for Agentic Hybrid Retrieval in Dataset Search

A new research paper presents a reference architecture for 'agentic hybrid retrieval' that orchestrates BM25, dense embeddings, and LLM agents to handle underspecified queries against sparse metadata. It introduces offline metadata augmentation and analyzes two architectural styles for quality attributes like governance and performance.

Apr 21, 202684% relevant

FedAgain: Dual-Trust Federated Learning Boosts Kidney Stone ID Accuracy to 94.7% on MyStone Dataset

Researchers propose FedAgain, a trust-based federated learning framework that dynamically weights client contributions using benchmark reliability and model divergence. It achieves 94.7% accuracy on kidney stone identification while maintaining robustness against corrupted data from multiple hospitals.

Mar 23, 202679% relevant

Function-Aware Fill-in-the-Middle Boosts SWE-Bench by +5.4 on 14B Models

Function-aware FIM mid-training boosts SWE-Bench by +2.8 to +5.4 on 7B-14B models, preserving general abilities. Six checkpoints and 400K dataset open-sourced.

Jul 16, 202685% relevant

Rich Sutton Launches Oak Lab to Build Self-Learning AI Agents

Rich Sutton founded Oak Lab to build self-learning AI agents. He rejects static datasets for real-time reinforcement learning with a trillion-parameter goal at 20W.

Jul 13, 2026100% relevant

AWS launches MCP server for its open-data registry

AWS launched an MCP server for its Registry of Open Data, giving AI agents natural-language access to 170+ public datasets.

Jul 7, 202675% relevant

SingGuard: Runtime Guardrails for Multimodal AI Treat Safety as Input

SingGuard treats safety rules as runtime inputs for multimodal AI, achieving SOTA across 6 families and 35 datasets via fast/slow reasoning.

Jun 30, 202685% relevant

How Simon Willison Ported a 0.2B Image Model to the Browser with Claude

Simon Willison used Claude Code to port a 0.2B image inpainting model to WebGPU, running it as a parallel side project while his main agent worked on Datasette. The technique? Research with Claude.ai, then hand off to Claude Code with research.md.

Jun 22, 202670% relevant

LOCUS-v1: 2.2M US Laws Hit HuggingFace via AI Pipeline

LOCUS-v1, a dataset of 2.2M US laws built via AI pipeline, released on HuggingFace. First comprehensive legal database of its kind, but quality and validation metrics remain undisclosed.

Jun 21, 202689% relevant

How to Build Claude Code Tools That Ask Users Questions Mid-Execution

Datasette Agent 0.2a0's `context.ask_user()` lets tools pause for user input mid-execution. Claude Code users can adopt this pattern for safer, more interactive tool workflows.

Jun 10, 202685% relevant

SpatialBench: New Benchmark Tests Foundation Models on 3D Tasks

SpatialBench, a new benchmark from ropedia_ai, evaluates spatial foundation models across 7 tasks and 5 datasets, testing depth estimation, surface normal prediction, and 3D object detection.

May 27, 202691% relevant

Federated Fine-Tuning Benchmark Shows QLoRA Nears Centralized Accuracy on

Sherpa.ai's arXiv benchmark shows federated fine-tuning with QLoRA matches centralized accuracy on four healthcare and finance datasets, outperforming isolated single-institution learning under non-IID conditions.

May 15, 202688% relevant

LASAR Cuts Latent Reasoning Steps in Half for GenRec at 20x Speedup Over CoT

LASAR nearly halves latent reasoning steps and achieves 20x speedup over explicit CoT in generative recommendation, outperforming baselines on three datasets.

May 12, 202680% relevant

Simple Graph Heuristic Beats Generative Recommenders on 10 of 14 Benchmarks

A no-training graph heuristic beats generative recommenders on 10 of 14 benchmarks, exposing shortcut-solvable datasets. Relative NDCG@10 gains hit 44% on Amazon CDs.

May 11, 2026100% relevant

OpenAI Privacy Filter Gets 6x More PII Labels via Nvidia Data

OpenAI has retrained its privacy filter using Nvidia's Nemotron-PII dataset, expanding PII detection from 8 to over 50 label types, targeting healthcare and enterprise use cases with better accuracy.

Apr 28, 202685% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety