data representation

30 articles about data representation in AI news

Research Challenges Assumption That Fair Model Representations Guarantee Fair Recommendations

A new arXiv study finds that optimizing recommender systems for fair representations—where demographic data is obscured in model embeddings—does improve recommendation parity. However, it warns that evaluating fairness at the representation level is a poor proxy for measuring actual recommendation fairness when comparing models.

Mar 26, 202680% relevant

BetterScene Bridges the Gap: How Aligning AI Representations Unlocks Photorealistic 3D Synthesis

Researchers introduce BetterScene, a novel AI method that dramatically improves 3D scene generation from just a handful of photos. By aligning the internal representations of a powerful video diffusion model, it produces consistent, artifact-free novel views, pushing the boundary of what's possible in computational photography and virtual world creation.

Feb 27, 202678% relevant

FedUTR: A New Federated Recommendation Method Using Text to Combat Data Sparsity

Researchers propose FedUTR, a federated recommendation system that augments sparse user interaction data with universal textual item representations. It achieves up to 59% performance improvements over state-of-the-art methods, offering a path to better privacy-preserving personalization where user data is limited.

Apr 10, 202678% relevant

Utonia AI Breakthrough: A Single Transformer Model Unifies All 3D Point Cloud Data

Researchers have developed Utonia, a single self-supervised transformer that learns unified 3D representations across diverse point cloud data types including LiDAR, CAD models, indoor scans, and video-lifted data. This breakthrough enables unprecedented cross-domain transfer and emergent behaviors in 3D AI.

Mar 4, 202685% relevant

AFMRL: Using MLLMs to Generate Attributes for Better Product Retrieval in

AFMRL uses MLLMs to generate product attributes, then uses those attributes to train better multimodal representations for e-commerce retrieval. Achieves SOTA on large-scale datasets.

Apr 23, 202684% relevant

Interluxe Group Launches Optima AI Index to Shape Luxury Discovery in

The Interluxe Group has introduced the Optima AI Index, a new data standard aimed at enhancing the accuracy and visibility of luxury brand information within generative AI platforms. This initiative seeks to address the challenge of inconsistent brand discovery in AI-driven search, providing a structured foundation for brand representation.

Apr 16, 202696% relevant

SSL: Structured Skill Language Boosts Skill Discovery MRR to 0.707

Researchers propose SSL, a three-layer typed JSON representation for AI agent skills, replacing unstructured SKILL.md prose. Using an LLM normalizer, SSL improves Skill Discovery MRR from 0.573 to 0.707 and Risk Assessment macro F1 from 0.744 to 0.787 on a newly released 6,184-skill corpus.

Apr 28, 202682% relevant

LLM Agents Will Reshape Personalization

Researchers propose that LLM-based assistants are reconfiguring how user representations are produced and exposed, requiring a shift toward inspectable, portable, and revisable user models across services. They identify five research fronts for the future of recommender systems.

Apr 23, 202684% relevant

LeWorldModel Solves JEPA Collapse with 15M Params, Trains on Single GPU

Researchers published LeWorldModel, solving the representation collapse problem in Yann LeCun's JEPA architecture. The 15M-parameter model trains on a single GPU and demonstrates intrinsic physics understanding.

Apr 20, 202695% relevant

Polarization by Default: New Study Audits Recommendation Bias in LLM-Based

A controlled study of 540,000 LLM-based content selections reveals robust biases across providers. All models amplified polarization, showed negative sentiment preferences, and exhibited distinct trade-offs in toxicity handling and demographic representation, with political leaning bias being particularly persistent.

Apr 20, 202684% relevant

LLM Schema-Adaptive Method Enables Zero-Shot EHR Transfer

Researchers propose Schema-Adaptive Tabular Representation Learning, an LLM-driven method that transforms structured variables into semantic statements. It enables zero-shot alignment across unseen EHR schemas and outperforms clinical baselines, including neurologists, on dementia diagnosis tasks.

Apr 15, 202699% relevant

Embedding Matching Distills Genomic Models 200x, Matches mRNA-Bench Performance

A new distillation framework transfers mRNA representations from a large genomic foundation model to a specialized model 200x smaller. It uses embedding-level distillation, outperforming logit-based methods and competing with larger models on mRNA-bench.

Apr 13, 202686% relevant

Meituan Proposes MBGR: A Generative Recommendation Framework for Multi-Business Platforms

Researchers from Meituan have published a paper on MBGR, a novel generative recommendation framework tailored for multi-business scenarios. It addresses the 'seesaw phenomenon' and 'representation confusion' that plague current methods, and has been successfully deployed on their food delivery platform.

Apr 6, 202692% relevant

Anthropic Paper: 'Emotion Concepts and their Function in LLMs' Published

Anthropic has released a new research paper titled 'Emotion Concepts and their Function in LLMs.' The work investigates the role and representation of emotional concepts within large language model architectures.

Apr 5, 202695% relevant

LeCun's Team Publishes LeWorldModel: A 15M-Parameter World Model That Mathematically Prevents Training Collapse

Yann LeCun's team has open-sourced LeWorldModel, a 15M-parameter world model that uses a novel SIGReg regularizer to make representation collapse mathematically impossible. It trains on a single GPU in hours and enables efficient physical prediction for robotics and autonomous systems.

Mar 27, 202695% relevant

CoRe Framework Integrates Equivariant Contrastive Learning for Medical Image Registration, Surpassing Baseline Methods

Researchers propose CoRe, a medical image registration framework that jointly optimizes an equivariant contrastive learning objective with the registration task. The method learns deformation-invariant feature representations, improving performance on abdominal and thoracic registration tasks.

Mar 26, 202675% relevant

Google Launches Gemini Embedding 2: A New Multimodal Foundation for AI Applications

Google has released Gemini Embedding 2, a second-generation multimodal embedding model designed to process text, images, and audio simultaneously. This technical advancement creates more unified AI representations, potentially improving search, recommendation, and personalization systems.

Mar 13, 202677% relevant

New Research: ADC-SID Framework Improves Semantic ID Generation by Denoising Collaborative Signals

A new arXiv paper proposes ADC-SID, a framework that adaptively denoises collaborative information to create more robust Semantic IDs for recommender systems. It specifically addresses the corruption of long-tail item representations, a critical problem for large retail catalogs.

Mar 12, 202690% relevant

When AI Gets Stumped: Study Reveals Language Models' 'Brain Activity' Collapses Under Pressure

New research shows that when large language models encounter difficult questions, their internal representations dramatically shrink and simplify. This 'activity collapse' reveals fundamental limitations in how current AI processes complex reasoning tasks.

Mar 11, 202685% relevant

Uber Eats Details Production System for Multilingual Semantic Search Across Stores, Dishes, and Items

Uber Eats engineers published a paper detailing their production semantic retrieval system that unifies search across stores, dishes, and grocery items using a fine-tuned Qwen2 model. The system leverages Matryoshka Representation Learning to serve multiple embedding sizes and shows substantial recall gains across six markets.

Mar 10, 202693% relevant

REPO: The New Frontier in AI Safety That Actually Removes Toxic Knowledge from LLMs

Researchers have developed REPO, a novel method that detoxifies large language models by erasing harmful representations at the neural level. Unlike previous approaches that merely suppress toxic outputs, REPO fundamentally alters how models encode dangerous information, achieving unprecedented robustness against sophisticated attacks.

Mar 2, 202675% relevant

Cross-View AI System Masters Object Matching Without Supervision

A novel CVPR 2026 framework achieves robust object correspondence between first-person and third-person views using cycle-consistent mask prediction, eliminating the need for costly manual annotations while learning view-invariant representations.

Mar 1, 202685% relevant

Google DeepMind's Unified Latents Framework: Solving Generative AI's Core Trade-Off

Google DeepMind introduces Unified Latents (UL), a novel framework that jointly trains diffusion priors and decoders to optimize latent space representation. This approach addresses the fundamental trade-off between reconstruction quality and learnability in generative AI models.

Feb 28, 202675% relevant

DeepMind's Diffusion Breakthrough: Training Better Latents for Superior AI Generation

Google DeepMind researchers have developed new techniques for training latent representations in diffusion models, potentially leading to more efficient, higher-quality AI-generated content across images, audio, and video domains.

Feb 26, 202685% relevant

Bridging Human Language and Machine Logic: New AI Framework Achieves Near-Perfect Translation Accuracy

Researchers have developed NL2LOGIC, an AI framework that translates natural language into formal logic with 99% syntactic accuracy. By using abstract syntax trees as an intermediate representation, the system dramatically improves semantic correctness and downstream reasoning performance.

Feb 17, 202670% relevant

FalkorDB: Graph Database for Multi-Hop AI Queries in Milliseconds

FalkorDB, an open-source graph database, stores connections as a sparse matrix to accelerate multi-hop queries by 100x. Combined with built-in vector search, it enables GraphRAG systems that answer complex relational questions without pre-built articles.

Apr 23, 202677% relevant

VoteGCL: A Novel LLM-Augmented Framework to Combat Data Sparsity in

A new paper introduces VoteGCL, a framework that uses few-shot LLM prompting and majority voting to create high-confidence synthetic data for graph-based recommendation systems. It integrates this data via graph contrastive learning to improve accuracy and mitigate bias, outperforming existing baselines.

Apr 22, 202690% relevant

Columbia Prof: LLMs Can't Generate New Science, Only Map Known Data

Columbia CS Professor Vishal Misra argues LLMs cannot generate new scientific ideas because they learn structured maps of known data and fail outside those boundaries. True discovery requires creating new conceptual maps, a capability current architectures lack.

Apr 21, 202687% relevant

Nature Paper: AI Misalignment Transfers Through Numeric Data, Bypassing Filters

A Nature paper shows an AI's misaligned goals can transfer to another AI through sequences of numbers, even after filtering harmful symbols. This challenges safety of training on AI-generated data.

Apr 18, 202695% relevant

FashionStylist: New Expert-Annotated Dataset Aims to Unify Multimodal

A new arXiv preprint introduces FashionStylist, a dataset with professional fashion annotations for item grounding, outfit completion, and outfit evaluation. It aims to address the fragmentation in existing fashion AI benchmarks by providing expert-level reasoning data.

Apr 13, 202686% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety