dataset
30 articles about dataset in AI news
QUMPHY Project's D4 Report Establishes Six Benchmark Problems and Datasets for ML on PPG Signals
A new report from the EU-funded QUMPHY project establishes six benchmark problems and associated datasets for evaluating machine and deep learning methods on photoplethysmography (PPG) signals. This standardization effort is a foundational step for quantifying uncertainty in medical AI applications.
Unitree Robotics Releases UnifoLM-WBT-Dataset: A Large-Scale, Real-World Robotics Dataset for Embodied AI
Chinese robotics firm Unitree Robotics has open-sourced the UnifoLM-WBT-Dataset, a high-quality dataset derived from real-world robot operations. The release aims to accelerate training for embodied AI and large language models applied to physical systems.
DIET: A New Framework for Continually Distilling Streaming Datasets in Recommender Systems
Researchers propose DIET, a framework for streaming dataset distillation in recommender systems. It maintains a compact, evolving dataset (1-2% of original size) that preserves training-critical signals, reducing model iteration costs by up to 60x while maintaining performance trends.
Unsloth Studio: Open-Source Web App Cuts VRAM Usage for Local LLM Training and Dataset Creation
Unsloth has launched Unsloth Studio, an open-source web application that enables users to run, train, compare, and export hundreds of LLMs locally with significantly reduced VRAM consumption. It also converts files like PDFs, CSVs, and DOCXs into training datasets.
Niantic's Pokémon GO Dataset of 30B Images Now Powers Centimeter-Precise Robotics Vision
Niantic's Lightship VPS, trained on 30 billion images from Pokémon GO players, now enables delivery robots to navigate with centimeter precision. The dataset represents the largest real-world visual positioning system ever created.
Massive Open-Source Dataset of Computer Screen Recordings Released to Train AI Agents
Researchers have released the world's largest open-source dataset of computer-use recordings on Hugging Face. The collection contains 48,478 screen recording videos totaling approximately 12,300 hours of professional software usage, licensed under CC-BY-4.0 for AI training and evaluation.
OpenAI's IH-Challenge Dataset: Teaching AI to Distinguish Trusted from Untrusted Instructions
OpenAI has released IH-Challenge, a novel training dataset designed to teach AI models to prioritize trusted instructions over untrusted ones. Early results indicate significant improvements in security and defenses against prompt injection attacks, marking a step toward more reliable and controllable AI systems.
HumanMCP Dataset Closes Critical Gap in AI Tool Evaluation
Researchers introduce HumanMCP, the first large-scale dataset featuring realistic, human-like queries for evaluating how AI systems retrieve and use tools from MCP servers. This addresses a critical limitation in current benchmarks that fail to represent real-world user interactions.
DeepVision-103K: The Math Dataset That Could Revolutionize AI's Visual Reasoning
Researchers have introduced DeepVision-103K, a comprehensive mathematical dataset with 103,000 verifiable visual instances designed to train multimodal AI models. Covering K-12 topics from geometry to statistics, this dataset addresses critical gaps in AI's visual reasoning capabilities.
DeepVision-103K: The Math Dataset That Could Revolutionize How AI 'Sees' and Reasons
Researchers have introduced DeepVision-103K, a massive dataset designed to train AI models to solve math problems by understanding both text and images. This approach could significantly improve how AI systems reason about the visual world.
FedAgain: Dual-Trust Federated Learning Boosts Kidney Stone ID Accuracy to 94.7% on MyStone Dataset
Researchers propose FedAgain, a trust-based federated learning framework that dynamically weights client contributions using benchmark reliability and model divergence. It achieves 94.7% accuracy on kidney stone identification while maintaining robustness against corrupted data from multiple hospitals.
Google's RT-X Project Establishes New Robot Learning Standard
Google's RT-X project has established a new standard for robot learning by creating a unified dataset of detailed human demonstrations across 22 institutions and 30+ robot types. This enables large-scale cross-robot training previously impossible with fragmented data.
Generative World Renderer: 4M+ RGB/G-Buffer Frames from Cyberpunk 2077 & Black Myth: Wukong Released for Inverse Graphics
A new framework and dataset extracts over 4 million synchronized RGB and G-buffer frames from Cyberpunk 2077 and Black Myth: Wukong, enabling AI models to learn inverse material decomposition and controllable game environment editing.
Neural Movie Recommenders: A Technical Tutorial on Building with MovieLens Data
This Medium article provides a hands-on tutorial for implementing neural recommendation systems using the MovieLens dataset. It covers practical implementation details for both dataset sizes, serving as an educational resource for engineers building similar systems.
Google Open-Sources TimesFM: A 100B-Point Time Series Foundation Model for Zero-Shot Forecasting
Google has open-sourced TimesFM, a foundation model for time series forecasting trained on 100 billion real-world time points. It requires no dataset-specific training and can generate predictions instantly for domains like traffic, weather, and demand.
New Research Proposes a Training-Free Method to Estimate Accuracy Limits for Sequential Recommenders
Researchers propose an entropy-based, model-agnostic estimator to quantify the intrinsic accuracy ceiling of sequential recommendation tasks. This allows teams to assess dataset difficulty and potential model headroom before development, and can guide data-centric decisions like user stratification.
KitchenTwin: VLM-Guided Scale Recovery Fuses Global Point Clouds with Object Meshes for Metric Digital Twins
Researchers propose KitchenTwin, a scale-aware 3D fusion framework that registers object meshes with transformer-predicted global point clouds using VLM-guided geometric anchors. The method resolves fundamental coordinate mismatches to build metrically consistent digital twins for embodied AI, and releases an open-source dataset.
How This Developer Built a Production-Ready RAG System with Claude Code in One Weekend
A developer used Claude Code to create a structured JSON-to-PDF knowledge base with 105 quotes, demonstrating how to build RAG-ready datasets faster than ever.
OpenCSF: A 1.5TB Free Computer Science Library Emerges from Unstructured Web Data
A new open-source dataset called OpenCSF has been compiled, containing 1.5TB of computer science materials scraped from public web sources. It provides a massive, free corpus for AI training and research in software engineering and CS education.
ReBOL: A New AI Retrieval Method Combines Bayesian Optimization with LLMs to Improve Search
Researchers propose ReBOL, a retrieval method using Bayesian Optimization and LLM relevance scoring. It outperforms standard LLM rerankers on recall, achieving 46.5% vs. 35.0% recall@100 on one dataset, with comparable latency. This is a technical advance in information retrieval.
AgenticGEO: Self-Evolving AI Framework for Generative Search Engine Optimization Outperforms 14 Baselines
Researchers propose AgenticGEO, an AI framework that evolves content strategies to maximize inclusion in generative search engine outputs. It uses MAP-Elites and a Co-Evolving Critic to reduce costly API calls, achieving state-of-the-art performance across 3 datasets.
MIPO: A Novel Self-Improvement Method for LLMs That Enhances Personalization Without New Data
Researchers propose Mutual Information Preference Optimization (MIPO), a contrastive data augmentation technique that improves LLM personalization by 3-40% on real-user datasets without requiring additional labeled data or human supervision.
Visual Product Search Benchmark: A Rigorous Evaluation of Embedding Models for Industrial and Retail Applications
A new benchmark evaluates modern visual embedding models for exact product identification from images. It tests models on realistic industrial and retail datasets, providing crucial insights for deploying reliable visual search systems where errors are costly.
HuggingFace Launches Daily Papers SKILL.md for AI Agents to Read, Search, and Fetch Research Papers
HuggingFace released Daily Papers SKILL.md, a tool enabling AI agents to read paper content as markdown, search papers, find linked models/datasets, and fetch papers via API.
ReFORM: A New LLM Framework for Multi-Factor Recommendation from User Reviews
Researchers propose ReFORM, a novel recommendation framework that uses LLMs to generate factor-specific user and item profiles from reviews, then applies multi-factor attention to personalize suggestions. It outperforms state-of-the-art baselines on restaurant datasets, offering a more nuanced approach to personalization.
A Counterfactual Approach for Addressing Individual User Unfairness in Collaborative Recommender Systems
New arXiv paper proposes a dual-step method to identify and mitigate individual user unfairness in collaborative filtering systems. It uses counterfactual perturbations to improve embeddings for underserved users, validated on retail datasets like Amazon Beauty.
New Research Identifies Data Quality as Key Bottleneck in Multimodal Forecasting
A new arXiv paper introduces CAF-7M, a 7-million-sample dataset for context-aided forecasting. The research shows that poor context quality, not model architecture, has limited multimodal forecasting performance. This has implications for retail demand prediction that combines numerical data with text or image context.
AI Learns Like Humans: New System Trains Language Models Through Everyday Conversations
Researchers have developed a breakthrough system that enables language models to learn continuously from everyday conversations rather than static datasets. This approach mimics human learning patterns and could revolutionize how AI systems acquire and update knowledge.
Anthropic's Pricing Revolution: Million-Token Context Now Standard for Claude AI
Anthropic has eliminated the 5x surcharge for million-token contexts in Claude 3 Opus and Claude 3.5 Sonnet, making long-context AI dramatically more affordable. This pricing overhaul removes barriers for developers analyzing large documents, codebases, and datasets.
Google's Groundsource: Using AI to Mine Historical Disaster Data from Global News
Google AI Research has unveiled Groundsource, a novel methodology using the Gemini model to transform unstructured global news reports into structured historical datasets. The system addresses critical data gaps in disaster management, starting with 2.6 million urban flash flood events.