data collection
30 articles about data collection in AI news
The Proxy-Free Web Scraping Revolution: How AI APIs Are Changing Data Collection
A new generation of web scraping APIs eliminates the need for manual proxy management, handling thousands of pages automatically while avoiding blocks. This represents a major shift toward AI-driven data collection infrastructure.
Indian Factory Workers Wear Head Cams to Gather Embodied AI Training Data
To overcome the high cost of robot fleet data collection, companies are deploying head cameras on human factory workers. This first-person video captures the sequencing, posture, and micro-adjustments of real work, serving as a proxy for expensive robotic action data.
Mercor Data Breach Exposes Expert Human Annotation Pipeline Used by Frontier AI Labs
Hackers have reportedly accessed Mercor's expert human data collection systems, which are used by leading AI labs to build foundation models. This breach could expose proprietary training methodologies and sensitive model development data.
UniScale: A Co-Design Framework for Data and Model Scaling in E-commerce Search Ranking
Researchers propose UniScale, a framework that jointly optimizes data collection and model architecture for search ranking, moving beyond just scaling model parameters. It addresses diminishing returns from parameter scaling alone by creating a synergistic system for high-quality data and specialized modeling. This approach, validated on a large-scale e-commerce platform, shows significant gains in key business metrics.
Study: Samsung, LG Smart TVs Capture Screenshots Every 15-60 Seconds
A study from UC Davis, UCL, and UC3M found Samsung TVs capture screenshots every minute and LG TVs every 15 seconds, even when used as monitors. This automated data collection feeds into AI-driven content recommendation and advertising systems.
Win11Debloat Script Disables Copilot, Recall, Removes Windows AI Bloat
The Win11Debloat script removes Microsoft Copilot, disables the Recall screenshot AI, and strips telemetry and ads from Windows. It highlights user pushback against Microsoft's aggressive AI and data collection integration.
Privacy-First Computer Vision: Transforming Luxury Retail Analytics from Showroom to Boutique
Privacy-first computer vision platforms enable luxury retailers to analyze in-store customer behavior, optimize merchandising, and enhance clienteling without compromising personal data. This transforms physical retail intelligence with ethical data collection.
Ladybird Robot Demonstrates Solar-Powered, Multi-Sensor Microclimate Monitoring for Precision Agriculture
A solar-powered 'Ladybird' robot autonomously performs precision microclimate monitoring, tracking wind, rainfall, and leaf moisture with onboard sensors. This showcases a practical application of robotics and AI for granular, real-time agricultural data collection.
Talisman Collection: A Case Study in AI-Driven Luxury Jewelry Design
The Talisman jewelry collection represents a direct application of AI in luxury, using algorithms to generate unique designs that blend historical motifs with modern aesthetics. This is a tangible product launch, not just a concept.
DAIMANTÉ Launches 'Talisman,' an AI-Designed Luxury Jewelry Collection
New brand DAIMANTÉ debuts its AI-driven Talisman jewelry collection, merging algorithmically abstracted ancient symbols with traditional goldsmithing and lab-grown diamonds. This marks a direct entry of an 'AI-led' brand into the luxury arena.
MeiGen Emerges as the 'Ultimate Prompt Collection' for AI Image Generation
A new tool called MeiGen has surfaced, described as the 'ultimate prompt collection' for AI image creators. It scrapes high-quality prompts from top AI artists and organizes them for easy access, potentially democratizing advanced image generation techniques.
Massive Open-Source Dataset of Computer Screen Recordings Released to Train AI Agents
Researchers have released the world's largest open-source dataset of computer-use recordings on Hugging Face. The collection contains 48,478 screen recording videos totaling approximately 12,300 hours of professional software usage, licensed under CC-BY-4.0 for AI training and evaluation.
HexaCercle Demonstrates Multi-Robot Hand Control System with 3ms Latency, 0.001° Precision
HexaCercle has demonstrated a wireless system enabling one operator to control multiple dexterous robotic hands with 1:1 movement replication. The system achieves 3ms data transmission latency and 0.001° collection precision.
Build-Your-Own-X: The GitHub Repository Revolutionizing Deep Technical Learning in the AI Era
A GitHub repository compiling 'build it from scratch' tutorials has become the most-starred project in platform history with 466,000 stars. The collection teaches developers to recreate technologies from databases to neural networks without libraries, emphasizing fundamental understanding over tool usage.
OpenClaw Skills: The GitHub Repository That's Supercharging AI Agents with 1,700+ Ready-to-Use Capabilities
A new GitHub repository called 'awesome-openclaw-skills' has emerged, offering over 1,715 production-ready AI agent skills that can be installed with a single CLI command. This collection promises to dramatically accelerate AI agent development by providing pre-built capabilities ranging from browser automation to complex data processing.
Claude AI Prompts Claim to Build Hedge Fund-Level Trading Strategies
A prompt collection claims to enable Claude to build and backtest hedge fund-level trading strategies. The prompts aim to automate quantitative analysis tasks typically performed by high-paid analysts.
OpenAI Internal Model Reportedly Solves Three New Erdős Problems, Marking AI Advance in Pure Mathematics
An internal AI model at OpenAI has reportedly solved three previously unsolved mathematical problems from the Erdős collection. This development signals a potential leap in AI's capacity for abstract reasoning and formal theorem proving.
Requestly Launches Git-Synced API Client to Replace Scattered Postman Setups
Requestly has launched an AI-powered API client that automatically syncs team collections through Git, eliminating stale docs and configuration drift. The tool directly targets the collaboration pain points of Postman and Insomnia users.
FDMTL Fall/Winter 2026: A Case Study in Handcrafted Luxury vs. Generative AI
Japanese denim brand FDMTL presents its Fall/Winter 2026 collection, framing handcrafted artistry as a deliberate counterpoint to generative AI. This highlights a strategic luxury narrative valuing human imperfection in an automated age.
RF-Mem: A Dual-Path Memory Retrieval System for Personalized LLMs
Researchers propose RF-Mem, a memory retrieval system for LLMs that mimics human cognitive processes. It adaptively switches between fast 'familiarity' and deep 'recollection' paths to personalize responses efficiently, outperforming existing methods under constrained budgets.
The Multimodal Retrieval Gap: New Benchmark Exposes Critical Weakness in AI Systems
Researchers introduce MultiHaystack, a benchmark revealing that multimodal AI models struggle significantly when required to retrieve evidence from large, mixed-media collections before reasoning. While models perform well when given correct evidence, their accuracy plummets when they must first locate it across 46,000+ documents, images, and videos.
MeiGen Revolutionizes AI Art Creation with Automated Prompt Curation
MeiGen, a new open-source tool, automatically scrapes and curates trending AI image prompts from social media, solving the problem of prompt discovery and organization for digital artists. The free platform aggregates weekly collections without requiring manual bookmarking or searching.
GitHub Repository Unleashes 1,715+ Production-Ready AI Agent Skills
A new GitHub repository has surfaced containing over 1,715 production-ready AI agent skills that developers can install and deploy in seconds. This collection represents a significant leap in accessible AI tooling, potentially accelerating agent-based application development across industries.
FashionStylist: New Expert-Annotated Dataset Aims to Unify Multimodal
A new arXiv preprint introduces FashionStylist, a dataset with professional fashion annotations for item grounding, outfit completion, and outfit evaluation. It aims to address the fragmentation in existing fashion AI benchmarks by providing expert-level reasoning data.
India's Human Motion Farms Train Humanoid Robots with First-Person Hand Data
Labs in India are capturing detailed human motion data—focusing on grip, force, and error recovery—to train AI models for humanoid robots. This addresses the critical bottleneck of acquiring physical intelligence data for robotics.
Figure CEO: Data Scarcity is the 'Only Thing' Holding Back General Robots
Figure CEO Brett Adcock asserts that solving general robotics is contingent on acquiring a 'pile of data' for training, highlighting the extreme cost and difficulty of collecting real-world robotic interaction data.
Android Phones Send Data to Google Every 4.5 Minutes, Study Finds
Research from Trinity College Dublin found Android phones send data to Google servers approximately every 270 seconds, regardless of user activity. This persistent telemetry fuels the AI training and advertising ecosystems that underpin Google's services.
FedUTR: A New Federated Recommendation Method Using Text to Combat Data Sparsity
Researchers propose FedUTR, a federated recommendation system that augments sparse user interaction data with universal textual item representations. It achieves up to 59% performance improvements over state-of-the-art methods, offering a path to better privacy-preserving personalization where user data is limited.
New arXiv Study Finds No Saturation Point for Data in Traditional Recommender Systems
A new arXiv preprint systematically tests how recommendation model performance scales with training data size. Using 10 algorithm variants across 11 large datasets, the research finds that normalized performance (NDCG@10) generally keeps improving up to 100 million interactions, with no clear saturation point for typical models.
Tencent Launches 2025 Ad Algorithm Challenge with Massive All-Modality Recommendation Datasets
Tencent has launched an open competition and released two industrial-scale datasets (TencentGR-1M and TencentGR-10M) to advance generative recommender systems. This has spurred related research into debiasing techniques and novel reranking frameworks, moving the field toward more holistic, multi-modal user modeling.