data collection

30 articles about data collection in AI news

The Proxy-Free Web Scraping Revolution: How AI APIs Are Changing Data Collection

A new generation of web scraping APIs eliminates the need for manual proxy management, handling thousands of pages automatically while avoiding blocks. This represents a major shift toward AI-driven data collection infrastructure.

85% relevant

Indian Factory Workers Wear Head Cams to Gather Embodied AI Training Data

To overcome the high cost of robot fleet data collection, companies are deploying head cameras on human factory workers. This first-person video captures the sequencing, posture, and micro-adjustments of real work, serving as a proxy for expensive robotic action data.

95% relevant

Mercor Data Breach Exposes Expert Human Annotation Pipeline Used by Frontier AI Labs

Hackers have reportedly accessed Mercor's expert human data collection systems, which are used by leading AI labs to build foundation models. This breach could expose proprietary training methodologies and sensitive model development data.

91% relevant

UniScale: A Co-Design Framework for Data and Model Scaling in E-commerce Search Ranking

Researchers propose UniScale, a framework that jointly optimizes data collection and model architecture for search ranking, moving beyond just scaling model parameters. It addresses diminishing returns from parameter scaling alone by creating a synergistic system for high-quality data and specialized modeling. This approach, validated on a large-scale e-commerce platform, shows significant gains in key business metrics.

95% relevant

Study: Samsung, LG Smart TVs Capture Screenshots Every 15-60 Seconds

A study from UC Davis, UCL, and UC3M found Samsung TVs capture screenshots every minute and LG TVs every 15 seconds, even when used as monitors. This automated data collection feeds into AI-driven content recommendation and advertising systems.

97% relevant

Win11Debloat Script Disables Copilot, Recall, Removes Windows AI Bloat

The Win11Debloat script removes Microsoft Copilot, disables the Recall screenshot AI, and strips telemetry and ads from Windows. It highlights user pushback against Microsoft's aggressive AI and data collection integration.

85% relevant

Privacy-First Computer Vision: Transforming Luxury Retail Analytics from Showroom to Boutique

Privacy-first computer vision platforms enable luxury retailers to analyze in-store customer behavior, optimize merchandising, and enhance clienteling without compromising personal data. This transforms physical retail intelligence with ethical data collection.

85% relevant

Ladybird Robot Demonstrates Solar-Powered, Multi-Sensor Microclimate Monitoring for Precision Agriculture

A solar-powered 'Ladybird' robot autonomously performs precision microclimate monitoring, tracking wind, rainfall, and leaf moisture with onboard sensors. This showcases a practical application of robotics and AI for granular, real-time agricultural data collection.

85% relevant

Talisman Collection: A Case Study in AI-Driven Luxury Jewelry Design

The Talisman jewelry collection represents a direct application of AI in luxury, using algorithms to generate unique designs that blend historical motifs with modern aesthetics. This is a tangible product launch, not just a concept.

88% relevant

DAIMANTÉ Launches 'Talisman,' an AI-Designed Luxury Jewelry Collection

New brand DAIMANTÉ debuts its AI-driven Talisman jewelry collection, merging algorithmically abstracted ancient symbols with traditional goldsmithing and lab-grown diamonds. This marks a direct entry of an 'AI-led' brand into the luxury arena.

75% relevant

MeiGen Emerges as the 'Ultimate Prompt Collection' for AI Image Generation

A new tool called MeiGen has surfaced, described as the 'ultimate prompt collection' for AI image creators. It scrapes high-quality prompts from top AI artists and organizes them for easy access, potentially democratizing advanced image generation techniques.

85% relevant

Massive Open-Source Dataset of Computer Screen Recordings Released to Train AI Agents

Researchers have released the world's largest open-source dataset of computer-use recordings on Hugging Face. The collection contains 48,478 screen recording videos totaling approximately 12,300 hours of professional software usage, licensed under CC-BY-4.0 for AI training and evaluation.

97% relevant

HexaCercle Demonstrates Multi-Robot Hand Control System with 3ms Latency, 0.001° Precision

HexaCercle has demonstrated a wireless system enabling one operator to control multiple dexterous robotic hands with 1:1 movement replication. The system achieves 3ms data transmission latency and 0.001° collection precision.

87% relevant

Build-Your-Own-X: The GitHub Repository Revolutionizing Deep Technical Learning in the AI Era

A GitHub repository compiling 'build it from scratch' tutorials has become the most-starred project in platform history with 466,000 stars. The collection teaches developers to recreate technologies from databases to neural networks without libraries, emphasizing fundamental understanding over tool usage.

85% relevant

OpenClaw Skills: The GitHub Repository That's Supercharging AI Agents with 1,700+ Ready-to-Use Capabilities

A new GitHub repository called 'awesome-openclaw-skills' has emerged, offering over 1,715 production-ready AI agent skills that can be installed with a single CLI command. This collection promises to dramatically accelerate AI agent development by providing pre-built capabilities ranging from browser automation to complex data processing.

85% relevant

Claude AI Prompts Claim to Build Hedge Fund-Level Trading Strategies

A prompt collection claims to enable Claude to build and backtest hedge fund-level trading strategies. The prompts aim to automate quantitative analysis tasks typically performed by high-paid analysts.

87% relevant

OpenAI Internal Model Reportedly Solves Three New Erdős Problems, Marking AI Advance in Pure Mathematics

An internal AI model at OpenAI has reportedly solved three previously unsolved mathematical problems from the Erdős collection. This development signals a potential leap in AI's capacity for abstract reasoning and formal theorem proving.

85% relevant

Requestly Launches Git-Synced API Client to Replace Scattered Postman Setups

Requestly has launched an AI-powered API client that automatically syncs team collections through Git, eliminating stale docs and configuration drift. The tool directly targets the collaboration pain points of Postman and Insomnia users.

85% relevant

FDMTL Fall/Winter 2026: A Case Study in Handcrafted Luxury vs. Generative AI

Japanese denim brand FDMTL presents its Fall/Winter 2026 collection, framing handcrafted artistry as a deliberate counterpoint to generative AI. This highlights a strategic luxury narrative valuing human imperfection in an automated age.

72% relevant

RF-Mem: A Dual-Path Memory Retrieval System for Personalized LLMs

Researchers propose RF-Mem, a memory retrieval system for LLMs that mimics human cognitive processes. It adaptively switches between fast 'familiarity' and deep 'recollection' paths to personalize responses efficiently, outperforming existing methods under constrained budgets.

77% relevant

The Multimodal Retrieval Gap: New Benchmark Exposes Critical Weakness in AI Systems

Researchers introduce MultiHaystack, a benchmark revealing that multimodal AI models struggle significantly when required to retrieve evidence from large, mixed-media collections before reasoning. While models perform well when given correct evidence, their accuracy plummets when they must first locate it across 46,000+ documents, images, and videos.

80% relevant

MeiGen Revolutionizes AI Art Creation with Automated Prompt Curation

MeiGen, a new open-source tool, automatically scrapes and curates trending AI image prompts from social media, solving the problem of prompt discovery and organization for digital artists. The free platform aggregates weekly collections without requiring manual bookmarking or searching.

85% relevant

GitHub Repository Unleashes 1,715+ Production-Ready AI Agent Skills

A new GitHub repository has surfaced containing over 1,715 production-ready AI agent skills that developers can install and deploy in seconds. This collection represents a significant leap in accessible AI tooling, potentially accelerating agent-based application development across industries.

85% relevant

FashionStylist: New Expert-Annotated Dataset Aims to Unify Multimodal

A new arXiv preprint introduces FashionStylist, a dataset with professional fashion annotations for item grounding, outfit completion, and outfit evaluation. It aims to address the fragmentation in existing fashion AI benchmarks by providing expert-level reasoning data.

86% relevant

India's Human Motion Farms Train Humanoid Robots with First-Person Hand Data

Labs in India are capturing detailed human motion data—focusing on grip, force, and error recovery—to train AI models for humanoid robots. This addresses the critical bottleneck of acquiring physical intelligence data for robotics.

89% relevant

Figure CEO: Data Scarcity is the 'Only Thing' Holding Back General Robots

Figure CEO Brett Adcock asserts that solving general robotics is contingent on acquiring a 'pile of data' for training, highlighting the extreme cost and difficulty of collecting real-world robotic interaction data.

85% relevant

Android Phones Send Data to Google Every 4.5 Minutes, Study Finds

Research from Trinity College Dublin found Android phones send data to Google servers approximately every 270 seconds, regardless of user activity. This persistent telemetry fuels the AI training and advertising ecosystems that underpin Google's services.

87% relevant

FedUTR: A New Federated Recommendation Method Using Text to Combat Data Sparsity

Researchers propose FedUTR, a federated recommendation system that augments sparse user interaction data with universal textual item representations. It achieves up to 59% performance improvements over state-of-the-art methods, offering a path to better privacy-preserving personalization where user data is limited.

78% relevant

New arXiv Study Finds No Saturation Point for Data in Traditional Recommender Systems

A new arXiv preprint systematically tests how recommendation model performance scales with training data size. Using 10 algorithm variants across 11 large datasets, the research finds that normalized performance (NDCG@10) generally keeps improving up to 100 million interactions, with no clear saturation point for typical models.

90% relevant

Tencent Launches 2025 Ad Algorithm Challenge with Massive All-Modality Recommendation Datasets

Tencent has launched an open competition and released two industrial-scale datasets (TencentGR-1M and TencentGR-10M) to advance generative recommender systems. This has spurred related research into debiasing techniques and novel reranking frameworks, moving the field toward more holistic, multi-modal user modeling.

87% relevant